Histogram data analysis based on Wasserstein distance

Size: px

Start display at page:

Download "Histogram data analysis based on Wasserstein distance"

Felicity Patrick
6 years ago
Views:

1 Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND DATA SCIENCE May 7, 8 & 9, 01 FLORENCE, Italy

2 Outline Histogram (distribution) data in the framework of the Symbolic Data Analysis New distances to compare histogram (distribution) data: Wasserstein Mallows l distance Some properties of this distance Methodological approaches based on l Wasserstein distance Basic Statistics for histogram data (including intervals and distributions as particular cases) Clustering methods: DCA and hierarchical (Ward criterion) Linear regression model: l as metric for Sum Square Errors in OLS estimation method

3 Histogram symbolic variable [Bock and Diday (000) ] A Modal variable is a multi-valued variable with a frequency, probability, or weight pi associated to each of its specific values U(i). I.e., the modal variable Y is a mapping Y(i) = {U(i), pi} for i W where pi is a nonnegative measure or a distribution on the domain Y of possible observation values and U(i) Y is the support of i. A Histogram variable is particular case of modal variable when the support of Y is numerical and U(i) are continuous and non overlapped intervals U(i) ={I i1,, I ih,..., I ih }; (I ih I ih = Ø) with associated a measure pi Single observations of a histogram variable Y are histogram data

4 Histogram data as a particular case of modal symbolic descriptions [Bock and Diday (000) ] Histogram data is a model for representing the empirical distribution of a continuous variable Y partitioned into a set of contiguous I h intervals (bins) with associated p h weights. A histogram is represented by a set of H ordered pairs (I h,p h )} such that: Ihi yh; y h ; yh yh ; yh, yh h1,..., H ; max I min y y h h' I I hi h h h1,..., H h1,..., H h h' p 0 h h1,..., H p h 1 Y(i)=[([0-10),0.);([10-0),0.5);([0-30),0.1); ([30-40],0.)] 0,6 0,5 0,4 0,3 0, 0,1 0 0,5 0, 0, 0,

5 Main Sources of histogram data Results of summary/clustering procedures From surveys From large databases From sensors 0,6 0,5 0,5 Temperatures Pollutant concentration Network activity 0,4 0,3 0, 0,1 0, 0,1 0, Data streams 0 Description of time window data sequences Image analysis Color bandwidths 0,6 0,5 0,4 0,5 0,4 Confidentiality data Summary data non punctual 0,3 0, 0,1 0 0, 0,1 0, 0, 0,1 0,1

6 Comparison of two histogram data How do compare two units described by two histograms? A possibility is to use metrics developed for comparing probability distributions 0,5 0,4 0,4 0,45 0,3 0,3 0,3 0, 0,1 0, 0,1 0,15 0,

7 Metrics used for the evaluation of the convergence of two probability measures (Gibbs and Su, 00) Given: - a domain W on which is possible to define a Borel-algebra, - two measure, - the density functions f and g - the corresponding distribution functions F and G and - a subdominant measure, like: =(+)/, Gibbs and Su (00) present a review of the most used dissimilarities:

8 A suitable measure to compute the distance between histograms: Wasserstein-Kantorovich metric Wasserstein-Kantorovich metric: W 1 1 1, ( ) ( ) d F t G t dt 0 the derived l Mallow s distance between two quantile functions W, ( ) ( ) d F t G t dt 0 The main difficulties to compute this distance is the analytical definition of the quantile function But in our case we treat especially with histogram data), indeed

Histograms are locally uniform Having represented an histogram Y(i) as {(I 1,p 1 ),,(I h,p h ),, (I m,p H )} (where H is the number of intervals of the support), we may define: the empirical density

9 Histograms are locally uniform Having represented an histogram Y(i) as {(I 1,p 1 ),,(I h,p h ),, (I m,p H )} (where H is the number of intervals of the support), we may define: the empirical density function f(y) as: p h f( y) if y [ yh; yh) f y h h yh p 0 otherwise h the empirical distribution function w h h p k k 1 0 if h 0 if h1,..., m hence the empirical quantile function (The inverse of the distribution function) y h y wh wh 1 F( y) wh y y iff y y y h h y y F( y) h t h h h 1 h1 F t y y h h y w h h t wh 1 wh wh 1 t w ( ) where 0 1 F 1 () t

Geometric interpretation The comparison of two quantile functions associated with two histograms requires the partition of the support of the two qf s into m intervals with associated uniform density

10 Geometric interpretation The comparison of two quantile functions associated with two histograms requires the partition of the support of the two qf s into m intervals with associated uniform density 1,00 0,95 0,90 0,85 0,9 t=0.90 0,9 0,80 0,75 0,70 0,7 t=0.70 0,65 0,60 0,6 0,55 t=0.60 0,50 0,45 p 3 0,40 0,35 t=0.30 0,30 0,3 0,5 0,0 0,15 0,10 t=0.15 0,15 0,05 0, I 11 I 1 10 I 31 I0 41 I I I 1 70I I 3 80I 4 I 5 90 I

11 l p l Ili yli, y li I y, y lj lj lj d W(I il ;I jl ) p l d W(I il ;I jl ) [0; 5] [60,70] [ 5;10] [70;73.3] [10; 16.6] [73.3;80] [16.6;0] [80;83.3] [0;30] [83.3;90] [30;40] [90;100] d Y Y d I I c c r r m m (, ) (, ) ( ) ( ) W i j pl li lj pl li lj li lj l1 l1 3 With y y y y Ili yli, y li ; cli ; rli li li li li center and radius of I li Verde, Irpino, COMPSTAT006

12 Multivariate distance Assuming independence among p variables we propose the an extension of d W distance to the multivariate case p=

13 Average histogram based on distance d W Wasserstein The barycenter (average) histogram Y(b) of a set of histogram data Y(i) (i=1,, n) can be computed minimizing the Sum of Square n Distances f ( Y( b*)) dw ( Y( i), Y( b)) i1 (like for a cloud of points in classical data analysis) n m 1 f ( Y( b*)) p ( c c ) ( r r ) i1 l1 3 l li lb li lb That is minimized when the usual first order conditions are satisfied: n f p c c 0 clb f p r lb 3 l li lb n n i1 1 1 clb n li ; n c rlb n rli i1 i1 lrli rlb 0 i Y( b) c r ; c r, p ;...; c r ; c r, p ;...; c r ; c r, p b b b b lb lb lb lb l mb mb mb mb m

14 Variance of the set of histograms Y(i) Determined Y(b) through the minimization problem n * arg min W ( ( ), ( )) i1 b d Y i Y b The variance of the set of histogram data is: n m pl ( cli clb ) ( rli rlb ) i1 l1 3 1

QQ plot Mean, variance and correlation Mean and variance of the quantile function: 100 95 90 85 80 75 70 65 60 0 10 0 30 40 1 x x ( t) dt x p c i i i l il 0 l 1 1 x () i xi t dt xi

15 QQ plot Mean, variance and correlation Mean and variance of the quantile function: x x ( t) dt x p c i i i l il 0 l 1 1 x () i xi t dt xi x i l cil ril x i 3 0 l p Correlation between quantile functions (x i, x j ) 1 x ( t) x ( t) dt x x 1 p c c r r xx i j i j l il jl il jl i j 0 l 3 ( xi, xj ) ( xi, x j ) x x i j i j

16 Interpretative decomposition of the distance in three components related to the location size and shape parameters W ( i, j ) : Fi ( ) Fj ( ) 0 d x x t t dt x x (1 ( x, x )) i j i j i j Location Size Shape i j In general case of distributions If the two distributions have the same shape: d ( x, x ) : x x W i j i j i j Location Size If they have the same size and shape: d ( x, x ) : x x W i j i j Location

17 Clustering methods for Histogram data based on Wasserstein distance Dynamic Clustering Algorithm Hierarchical method (Ward criterion)

18 Dynamic Clustering Algorithm general schema of the Nuées Dynamiques (Diday 197) The algorithm aims to obtain a partition P of a set E of symbolic data in k clusters and a set L of k prototypes {G 1,,G i,,g k } that best represent the clusters {C 1,,C i,,c k } of the partition P The algorithm optimizes a criterion D of fitting between prototypes and clusters: * * D ( P, L ) Min D( P, L) P P, L k L k where : P k is the set of all the partitions of E into k clusters and L k is the set of k prototypes. The algorithm executes alternatively: an allocation step and a representation step In this case E is a set of histogram data

19 Wasserstein distance for Clustering data according to Dynamic Clustering algorithm classical schema (a) Initialization k prototypes Y(b 1 ),...,Y(b K ) of L are randomly chosen (b) Allocation step For each histogram Y(i) of E the allocation index l to the clusters is computed and Y(i) is assigned to the cluster C l where: (c) Representation step For each cluster C k is identified the prototype Y(b k ) of L that minimizes D ( C, Y( b )) d ( Y( i), Y( b )) k k W k ic k l arg min d ( Y( i), Y( b )) k1,...,k W k (b) and (c) are repeated until the convergence Wasserstein l distance Histogram prototype/ barycenter of histograms belonging to the cluster C k

20 Property of Wasserstein distance: Inertia decomposition Being the prototype the barycenter and Euclidean distance, then is a squared is the inertia for grouped data Then, according to the Huygens theorem TI can be decomposed in Within and Between inertia d W

Some results Application on US monthly temperatures data set We have considered a dataset constituted by the Monthly Average Temperatures recorded in the 48 states of US from 1895 to 004 (Hawaii and

21 Some results Application on US monthly temperatures data set We have considered a dataset constituted by the Monthly Average Temperatures recorded in the 48 states of US from 1895 to 004 (Hawaii and Alaska are not present in the dataset). The analysis consists of the following three steps: 1. Representation of the distributions of temperatures of each State for each month by means of histograms;. Computing of the distance matrix using d W; 3. DCA procedure to find the best partition P 4. Calinski Harabaz index is computed to compute the optimal number k of clusters 5. Hierarchical clustering procedure based on the Ward criterion The original dataset is freely available at the National Climatic Data Center website

22 Histogram barycenter (prototype) Montana February Florida Barycenter (prototype)

23 Experimental results Dynamic Clastering algorithm results: The first 5 columns represent the Quality Partition index (min, max, means, median) on 100 iterations the last one the best value of Calinski-Harabaz index

24 Hierarchical clustering according to the Ward criterion - based on Wasserstein distance For example, we can apply the Ward criterion for a hierarchical clustering of the set E of histogram data: the procedure joins those two classes which minimize

25 Hierarchical tree

26 Final results

27 Regression model for histogram variables based on Wasserstein distance

A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one

28 A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? Billard, Diday, IFCS 006 Verde, Irpino, COMPSTAT 010; CLADAG 011 Dias, Brito, ISI 011

29 Regression with histograms variables: a proposal in SDA framework A solution was given by Billard and Diday (006) The model fit a linear regression line throught the mixture of the n bivariate distributions Given a punctual value of X it is possible to predict the punctual value of Y

30 Linear Regression Model for histogram data: our approach involves SDA as well as Functional Data Analysis (Verde, Irpino, 010) Given a histogram variable X, we search for a linear trasformation of X which allows us to predict the histogram variable Y For example: given the histogram of the temperatures observed in a region during a month, is it possible to predict the distribution of the temperature of another month using a linear transformation of the histogram variable? A histogram by a histogram

31 Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p y ( t) b b x ( t) e ( t) i 0 j ij i j1 Quantile functions associated to histogram/ distribution data Y( t) X ( t) b e( t) This formulation is analogous to the functional linear model (Ramsay, Silverman, 003) except for the constant b parameters and for the functions y i (t), x ij (t) which are quantile functions while each e i (t) is a residual function (distribution?) for all i=1,, n.

32 Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: W 0 i j i j d x,x F ( t ) F ( t ) dt Wasserstein l distance between two quantile functions

33 Interpretation of the Linear Regression model The regression model is here proposed to find the best linear transformation of the X j (t) s in order to predict Y(t) y i (t) = b 0 + b 1 x 1i (t) + + b p x pi (t) + e i (t) y i (t), x 1i (t),, x pi (t) are quantile functions and the estimated response variable yˆ i () t is again a quantile function = linear combination of quantile functions

34 Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: yˆ ( t) b b x ( t) t [0, 1] i 0 j ij j1 p The linear transformation is unique: the parameters b 0 and b j are estimated for all the x ij and y i distributions A first problem: Only if b j > 0 a quantile function yˆ i () t can be derived. In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm.

35 Solution The quantile function can be decomposed as: x c ( t) x x ( t) where ij ij ij c x ( t) x ( t) x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i b0 b j ij g j ij i j1 j1 y ( t) x x ( t) ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (b 0,b j ; g j ).

36 Least Squares parameters estimation n n arg min f ( b, ) ( ), ˆ j g j ei () t dw yi t yi ( t) bj, g j i1 i1 n 1 p p c c SSE f ( b j, g j ) yi yi ( t) b0 b jxij g ijxi ( t) dt i1 0 j1 j1 Matrix notation: 1 c c SSE Y Y ( t) X X ( t) dt 0

37 The estimated parameters x y ny x ( x, y ) ˆ b y ˆ b x; ˆ b ; ˆ g n Correlation between quantile functions x i i i i i xi y i1 i1 n 1 n xi nx xi i1 i1 y i n i It is easy to see that: ˆ b, ˆ b and ˆ g

38 Multiple regression estimated parameters According to the properties of the Wasserstein distance between 1 c c quantile functions, the elements of the product matrix X ( t)' X ( t) computed according to our definition of a inner product operator between two quantile functions x ij and x i k : Then, the parameters are estimated under the constraint j (j=1,, p) using the NNLS (Lawson, Hanson, 1974). 1 ˆ X ' X ' X Y ˆ X ( t)' X ( t) dt X ( t)' Y ( t) dt 1 0 x ( t) x ( t) dt ( x, x ) ij i' k ij i' k x x c c c c 0 0 X c otherwise ij X c i ' k 0 if g 0 j ˆ g 0 it is a transformed matrix by NNLS

39 Interpretation of the parameters Regression parameters for the distribution mean locations ˆ b ˆ,..., ˆ 0, b1 b p Shrinking factors for the variability ˆ 1,..., ˆp g g y > 1 (< 1) the ˆi histogram has a greater (smaller) variability than the x ij histogram.

40 Tools for the interpretation The sum of squares of Y is n n 1 ( ) W i ( ), ( ) i ( ) ( ) i1 i1 0 SS Y d y t y t y t y t dt In classical regression model, the SS(Y) is constituted by: SS Error +SS Regression

41 Decomposition of SS(Y) Being: we obtain p p c i b0 b j ij g j ij j1 j1 yˆ ( t) x x ( t) n n 1 ( ) W i ( ), ( ) i( ) i( ) i1 i1 0 ˆ SS Y d y t y t y t y t dt n 1 1 i i1 0 0 i1 Error y( t) yˆ ( t) dt n y( t) e ( t) dt SS Regression n 1 e ( t) ( y ( ) ˆ i t yi( t)) t [0,1] n SS Average error function Bias

42 The bias The bias is due to different shapes of distributions: p c bias n ˆ y g j( x j( t), y( t)) c x y j j1 Correlation between the average quantile functions bias=0 when all the histograms data have the same shape That represents the incapacity of the linear transformation of fitting distributions that are very different in shape

43 A measure of fitting Pseudo R Considering that n 1 1 n c c ˆ ˆ Regression i i i i1 i1 0 0 SS y y y ( t) y ( t) dt y( t) e( t) dt We propose the following pseudo R PseudoR SS min max 0;1 Error ;1. SS( Y)

An application on the Ozone concentration in 78 USA sites (http://java.epa.

44 An application on the Ozone concentration in 78 USA sites ( Hourly monitoring of Ozone ground-level concentration and other meteorological variables

45 Forecasting OZONE levels Ozone is a gas that can cause respiratory deseases. In the literature there exists studies that relates the OZONE level to the Temperature, the Wind speed and the Solar radiation. Given the distribution of TEMPERATURE (C ), the distribution of WIND SPEED (meters per second) and the distribution of SOLAR RADIATION (watt per square meter), the main objective is to predict the distribution of OZONE CONCENTRATION (particles per billion) using a linear model. We have chosen as period of observation the Summer seasons of 010 and the central hours of the days (10 a.m. 5 p.m.). The model is: OZONEt ( ) b b TEMP b WSPEED b SORAD g TEMP ( t) g WSPEED ( t) g SORAD ( t) e () t c c c 1 3

46 The estimated model OZONE() t Wasserstein based LS (WASS-LS) (the current proposal) In parentesis the 95% Bootstrap C.I TEMP WSPEED SORAD -1.4; ; ,8;,34 0,049;0,09 c c c TEMP ( t) WSPEED ( t) SORAD ( t) 0,49;1,36 1,08;3,0 0,0118;0,04 Billard (006) regression (SREG) OZONE 6, , 358TEMP 1, 555 WSPEED 0, 0086 SORAD [19.17;34.74] [ 0.03;0.51] [0.78;3.5] [0.005;0.0117]

47 Diagnostics Diagnostics WASS- LS SREG Rsquare NA 0.08 Pseudo Rsquare * W (Dias-Brito 011) 0.74 NA RMSE_W ,93* RMSE_L * RMSE. n 1 ˆ L n d Fi i i1 d F ( y) Fˆ ( y) dy i i ( y), F ( y) W n 1i n 1i W W d yˆ ( t), Y d y ( t), Y i i n 1 ˆ W W i t i i1 n 1 where Y n yi i1 RMSE n d y ( ), y ( t) * The Billard model does not allow to compute directly a distribution. Given a set of input distributions, a Montecarlo experiment is needed to compute an output distribution.

48 Main references BILLARD, L. and DIDAY, E. (006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley Series in Computational Statistics. John Wiley & Sons. BOCK, H.H. and DIDAY, E. (000): Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation, Springer-Verlag. CUESTA-ALBERTOS, J.A., MATRAN, C., TUERO-DIAZ, A. (1997): Optimal transportation plans and convergence in distribution. Journ. of Multiv. An., 60, GIBBS, A.L. and SU, F.E. (00): On choosing and bounding probability metrics. Intl. Stat. Rev. 7 (3), IRPINO, A., LECHEVALLIER, Y. and VERDE, R. (006): Dynamic clustering of histograms using Wasserstein metric. In: Rizzi, A., Vichi, M. (eds.) COMPSTAT 006. Physica-Verlag, Berlin, VERDE, R. and IRPINO, A.(008): Comparing Histogram data using a Mahalanobis Wasserstein distance. In: Brito, P. (eds.) COMPSTAT 008. Physica Verlag, Springer, Berlin, VERDE, R. and IRPINO, A.(010): Ordinary Least Squares for Histogram Data based on Wasserstein Distance COMPSTAT 010 DIAS, S. and BRITO P.,(011) A new linear regression model for histogram-valued variables, ISI 011

49 Thank you Rosanna Verde Antonio Irpino

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances