Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid

Size: px

Start display at page:

Download "Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid"

Edwina Arnold
5 years ago
Views:

1 Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid

2 Introduction 2

3 Symbolic data Symbolic data was introduced by Edwin Diday in 1987 to represent variability Symbolic variables make possible to describe groups of individuals and concepts Symbolic variables include List of values variables (with or without weights), Interval variables, and Histogram variables Symbolic representations can include internal structure (hierarchies) and logical dependency (rules) 3

4 Interval Data Interval data is by far the most popular symbolic representation Interval data is naturally used in several contexts: For example, Meteorology and Finance It is easier to propose methods for interval variables than for other symbolic variables Intervals are represented by two values: minimum and maximum, or center and radius The truly symbolic methods are those that deal with interval representations and not those that represent them by means of two classical variables 4

5 Histogram data Histogram data is the younger brother of interval data Histograms is a statistical tool Histogram data does not arise naturally Histograms requires the definition of parameters There are less methods developed for histogram data Because the histogram representation is considerably more sophisticated than the interval one However, the situation is changing in the recent years and histogram (and distributional) data is developing its potential Distributions are the numbers of the future present! 5

6 From histogram data to distributional data 6

7 From histogram data to distributional data Distributional variables make possible to describe each unit of interest by means of its observed data distribution A distribution does not summarize data by means of statistics, such as the mean, variance, minimum and maximum, etc. IT IS THE DATA! The distribution has to be represented in some way The practitioner can focus in the representation that fits better his problem Two types of representations: Binned density estimators, such as histograms Smooth and continue density estimators 7

8 Binned density estimators Binned density estimators offer a great variety of choices for the analyst Histograms with a fixed bin width (classical histogram) If an accurate representation is needed, there are rules to determine the optimal bin width (Wand, 1997) Equifrequency histograms (such as, boxplots) Histograms defined on a given partition of the range of the variable of interest Fink et al. (2009) propose a method to accurately estimate histograms with variable-width bars Histograms defined as a sequence of quantiles of specific interest The frequency poligon Delicado and Del Rio (2003) propose a generalization of frequency poligons to accurately estimate distributions 8

9 Binned density estimators Binned density estimators have some problems too Not smooth Depend on end points of bins Depend on width of bins 9

10 Smooth estimators If the analyst wants a smooth representation, he/she can use parametric methods, if the underlying distribution is known Kernel density estimation methods (Simonoff, 1998) require A kernel: uniform, triangular, Epanechnikov, normal, etc The selection of the bandwidth of the kernel As small as the data allow, Trade-off between the bias of the estimator and its variance It is usually chosen the value that minizes the AMISE (Asymptotic Mean Integrated Squared Error) Jones et al. (1996) survey of methods to estimate the bandwith 10

11 The Quantile Function 11

12 The quantile function The quantile function (QF) is the inverse of the cumulative distribution function Density function Cumulative distribution function Quantile function 12

13 The quantile function The QF provides a common framework to represent data described by histograms, intervals, nominal multi-value types, etc (Ichino, 2011) The QF is the conceptual underpinning of many of the methods proposed for histogram data The QF has some interesting features: Fixed range for the X-axis [0,1] The Wasserstein metrics for QF make possible to compute distribution-valued central tendency moments and its associated dispersion measures of distributional data These distribution-valued moments are the basis to propose many methods for distributional data, E.g. methods based on the concept of average 13

14 The catalogue of methods 14

15 Methods proposed for distributional data Brito (2012) provides a very nice survey on the topic The catalogue of methods for distributional data includes Descriptive statistics Regression Clustering Dimensionality reduction Time series forecasting Visualization methods 15

16 Methods proposed for distributional data Descriptive statistics Bertrand and Goupil (2000) proposes real-valued univariate and bivariate statistics, extended by Billard and Diday (2006) Irpino and Verde (2014) proposes a new set of real-valued statistics based on the l 2 Wasserstein distance Rivoli et al. (2012) propose central tendency moments in distribution form based on the family of Wasserstein metrics Regression Billard and Diday (2006) proposed the first model based on real-valued univariate and bivariate statistics Irpino and Verde (2012) and Dias and Brito (2013) propose regression models based on the QF and the l 2 Wasserstein distance 16

17 Methods proposed for distributional data Clustering Many approaches to cluster this kind of data The simple hierarchical models just need an appropriate distance Irpino and Verde (2006) propose a dynamic clustering method based on the l 2 Wasserstein distance and that averaged histograms for the first time Brito and Ichino (2011) propose hierarchical clustering methods based on quantile representations of the different types of data Brito and Chavent (2012) propose a divisive algorithm that works with interval and/or histogram-valued variables using appropriate distances 17

18 Methods proposed for distributional data Dimensionality reduction Rodriguez et al. (2000) propose a Principal Component Analysis for histograms, where histograms are a succession of nested intervals Nagabhushan and Pradeep Kumar (2007) propose a histogram arithmetic to extend PCA for a simplified version of histogram data (more close to compositional data) Delicado (2011) extends PCA to density functions using functional data methods and tools from compositional data (Egozcue et al., 2006) Ichino (2011) adapts the Principal Components Analysis to work with symbolic data using the quantile representation Time series forecasting Arroyo and Maté (2009) adapts the k-nn Arroyo et al. (2011) adapts exponential smoothing methods All these methods are based on Wasserstein metrics 18

19 Methods proposed for distributional data Visualization methods Sopan et al. (2013) propose methods to intuitively visualize distributional data sets Distributions of distributions 19

20 How do methods work with both representations? Methods based on quantile functions and/or distances for quantile functions They do not need adaptation to work with both histograms and smooth distributions Methods based on bin-representations They are meant to deal with histogram representations They can be adapted to work with smooth representations First, the smooth representation has to be estimated Second, the smooth representation is transformed into a histogram representation by a sufficiently large number of bins 20

21 How do methods work with both representations? The smooth representation can be transformed into a histogram by a sufficiently large number of bins 21

22 How about the performance of this trick? l2 Wasserstein distance for histogram data Matlab 7 R2010 on a laptop PC 1.8GHz 2 cores 8GB RAM Simulate two sets of 10 6 data from N(1,1) and N(3,1) Estimate the two quantile histograms and measure l 2 Wasserstein distance between them (x1000 times) # bins time (sec) distance Smooth - 2 The matlab code still could be optimized 22

23 Some contributions from other fields 23

24 Distributions as seen from other fields Compositional data (Aitchison, 1986) is a related field Non-negative data with constant sum that provide a quantitative description of the parts of some whole Egozcue et al. (2006) say that density functions are infinite dimensional compositional data Distributional data can be considered as a particular case of functional data (Ramsay and Silverman, 1997) It is not direct to extend functional methods to distributional data Pointwise operations such as linear combination for functional data do not work Some alternatives to operate with them are needed Aitchison geometry (Aitchison, 1986): operations and distances Convolution operators 24

25 Applications everywhere! 25

26 Applications everywhere! Image analysis Images are represented by histograms Applications in fields such as precision agriculture or medicine (Sharma et al. 2013) Sensor and radar data Spatial and temporal aggregation For example, river level data and cloud sensor data Wind power production: aggregation of the generators in a wind farm Finance and Economics Contemporaneous and temporal aggregation of financial returns Applications in portfolio management and Value at Risk (Arroyo et al. 2011) 26

27 Applications everywhere! Official Statistics (distributions of variables in the population) Population pyramids (Delicado, 2011) Income distributions (Kneip and Utikal, 2001; Delicado, 2007) Rating distributions Recommendation systems and rating webs (Sopan et al., 2013) Trust and Reputation systems Opportunities in big data SDA 2014 & COMPSTAT 2014 sessions! and in many other fields E.g. social networks data and computer logs (web, search engines, etc) If a mean or a sample is used as a tool for aggregation to make the data manageable, then it is possible to use a distribution 27

28 Conclusions Distributional data is now a mature field in terms of theory Homogeneous conceptual underpinning Diverse catalogue of methods High applicability in real life problems However, it is barely known and barely used outside the symbolic family. So, we have a mission: Spread the word Propose interesting applications Possible synergies with other fields, such as functional data or compositional data 28

29 How can I analyze my distributional data? There is a plan to develop a toolbox for distributional data Matlab, Octave and R Open Source For binned and smooth representations To include: Distribution estimation and visualization Descriptive Statistics Clustering Regression Time Series forecasting Eventually, all the methods proposed People involved: Antonio Irpino, Antonio Balzanella, Sonia Dias and Javier Arroyo 29

30 References 30

31 References Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall Arroyo, J., Maté, C. (2009). Forecasting histogram time series with k-nearest neighbours methods. International Journal of Forecasting 25 (1), Arroyo, J., Mate, C., Muñoz San Roque, A. (2011). Smoothing Methods for Histogram-valued Time Series. An application to Value-at-Risk. Statistical Analysis and Data Mining 4, Bertrand, P., Goupil, F. (2000). Descriptive statistics for symbolic data. In: Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, pp Billard, L., Diday, E. (2006). Symbolic data analysis: conceptual statistics and data mining. Wiley Brito, P. (2012). Beyond summaries of individual data: Analyzing distributions. In Symposium on Learning and Data Science,

32 References Brito, P. and Chavent, M. (2012). Divisive Monothetic Clustering for Interval and Histogram-Valued Data. In: Proc. ICPRAM st International Conference on Pattern Recognition Applications and Methods Brito, P. and Ichino, M. (2011). Conceptual Clustering of Symbolic Data Using a Quantile Representation: Discrete and Continuous Approaches. In: Proc. Workshop on Theory and Application of High-dimensional Complex and Symbolic Data Analysis in Economics and Management Science Delicado, P. (2007). Functional k -sample problem when data are density functions. Computational Statistics 22 (3), Delicado, P. (2011). Dimensionality reduction when data are density functions. Computational Statistics and Data Analysis 55 (1), Delicado, P. and del Río, M. (2003). A Generalization of Histogram Type Estimators. Journal of Nonparametric Statistics 15 (1),

33 References Dias, S., Brito, P. (2013). Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables. arxiv: Egozcue, J.J., Díaz-Barrero, J.L., Pawlowsky-Glahn, V. (2006). Hilbert space of probability density functions based on Aitchison geometry. Acta Mathematica Sinica 22, Fink, E., Sarin, A., Carbonell, J. (2009). Analysis of uncertain data: Smoothing of histograms. Proc. of the IEEE International Conference on Systems, Man and Cybernetics, pp Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining 4, Irpino, A., Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In Data Science and Classification, pp

34 References Irpino, A., Verde, R (2012). Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. arxiv v3 Irpino, A., Verde, R (2014). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification Jones, M.C., Marron, J.S., Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91 (433), Kneip A., Utikal K. (2001). Inference for density families using functional principal component analysis. Journal of the American Statistical Association 96, Nagabhushan, P., Pradeep Kumar, R. (2007). Histogram PCA. In Proc. of the 4th International Symposium on Neural Networks, pp Ramsay, J.O., Silverman, B. W. (1997). Functional Data Analysis. Springer Rivoli, L., Irpino, A., Verde, R. (2012). The median of a set of histogram data. In XLVI meeting of theitalia Statistical Society 34

35 References Rodriguez O., Diday E., Winsberg S. (2000). Generalization of the Principal Components Analysis to Histogram Data. In: Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Data Bases Schweizer, B. (1984). Distributions are the numbers of the future. In Proceedings of the mathematics of fuzzy systems meeting (Naples, Italy), pp Sharma, A. et al (2013). Spatiotemporal modeling of discrete-time distribution-valued data applied to DTI tract evolution in infant neurodevelopment. In Proc. IEEE International Symposium Biomedical Imaging. pp Simonoff, J. S. (1998). Smoothing Methods in Statistics. Springer Sopan, A., Freire, M., Taieb-Maimon, M., Plaisant, C., Golbeck, J., Shneiderman, B. (2013). Exploring data distributions: Visual design and evaluation. International Journal of Human-Computer Interaction 29 (2), Wand, M. P. (1997). Data-based choice of histogram bin width. The American Statistician 51(1),

A new linear regression model for histogram-valued variables

Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do