Histogram data analysis based on Wasserstein distance
|
|
- Felicity Patrick
- 6 years ago
- Views:
Transcription
1 Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND DATA SCIENCE May 7, 8 & 9, 01 FLORENCE, Italy
2 Outline Histogram (distribution) data in the framework of the Symbolic Data Analysis New distances to compare histogram (distribution) data: Wasserstein Mallows l distance Some properties of this distance Methodological approaches based on l Wasserstein distance Basic Statistics for histogram data (including intervals and distributions as particular cases) Clustering methods: DCA and hierarchical (Ward criterion) Linear regression model: l as metric for Sum Square Errors in OLS estimation method
3 Histogram symbolic variable [Bock and Diday (000) ] A Modal variable is a multi-valued variable with a frequency, probability, or weight pi associated to each of its specific values U(i). I.e., the modal variable Y is a mapping Y(i) = {U(i), pi} for i W where pi is a nonnegative measure or a distribution on the domain Y of possible observation values and U(i) Y is the support of i. A Histogram variable is particular case of modal variable when the support of Y is numerical and U(i) are continuous and non overlapped intervals U(i) ={I i1,, I ih,..., I ih }; (I ih I ih = Ø) with associated a measure pi Single observations of a histogram variable Y are histogram data
4 Histogram data as a particular case of modal symbolic descriptions [Bock and Diday (000) ] Histogram data is a model for representing the empirical distribution of a continuous variable Y partitioned into a set of contiguous I h intervals (bins) with associated p h weights. A histogram is represented by a set of H ordered pairs (I h,p h )} such that: Ihi yh; y h ; yh yh ; yh, yh h1,..., H ; max I min y y h h' I I hi h h h1,..., H h1,..., H h h' p 0 h h1,..., H p h 1 Y(i)=[([0-10),0.);([10-0),0.5);([0-30),0.1); ([30-40],0.)] 0,6 0,5 0,4 0,3 0, 0,1 0 0,5 0, 0, 0,
5 Main Sources of histogram data Results of summary/clustering procedures From surveys From large databases From sensors 0,6 0,5 0,5 Temperatures Pollutant concentration Network activity 0,4 0,3 0, 0,1 0, 0,1 0, Data streams 0 Description of time window data sequences Image analysis Color bandwidths 0,6 0,5 0,4 0,5 0,4 Confidentiality data Summary data non punctual 0,3 0, 0,1 0 0, 0,1 0, 0, 0,1 0,1
6 Comparison of two histogram data How do compare two units described by two histograms? A possibility is to use metrics developed for comparing probability distributions 0,5 0,4 0,4 0,45 0,3 0,3 0,3 0, 0,1 0, 0,1 0,15 0,
7 Metrics used for the evaluation of the convergence of two probability measures (Gibbs and Su, 00) Given: - a domain W on which is possible to define a Borel-algebra, - two measure, - the density functions f and g - the corresponding distribution functions F and G and - a subdominant measure, like: =(+)/, Gibbs and Su (00) present a review of the most used dissimilarities:
8 A suitable measure to compute the distance between histograms: Wasserstein-Kantorovich metric Wasserstein-Kantorovich metric: W 1 1 1, ( ) ( ) d F t G t dt 0 the derived l Mallow s distance between two quantile functions W, ( ) ( ) d F t G t dt 0 The main difficulties to compute this distance is the analytical definition of the quantile function But in our case we treat especially with histogram data), indeed
9 Histograms are locally uniform Having represented an histogram Y(i) as {(I 1,p 1 ),,(I h,p h ),, (I m,p H )} (where H is the number of intervals of the support), we may define: the empirical density function f(y) as: p h f( y) if y [ yh; yh) f y h h yh p 0 otherwise h the empirical distribution function w h h p k k 1 0 if h 0 if h1,..., m hence the empirical quantile function (The inverse of the distribution function) y h y wh wh 1 F( y) wh y y iff y y y h h y y F( y) h t h h h 1 h1 F t y y h h y w h h t wh 1 wh wh 1 t w ( ) where 0 1 F 1 () t
10 Geometric interpretation The comparison of two quantile functions associated with two histograms requires the partition of the support of the two qf s into m intervals with associated uniform density 1,00 0,95 0,90 0,85 0,9 t=0.90 0,9 0,80 0,75 0,70 0,7 t=0.70 0,65 0,60 0,6 0,55 t=0.60 0,50 0,45 p 3 0,40 0,35 t=0.30 0,30 0,3 0,5 0,0 0,15 0,10 t=0.15 0,15 0,05 0, I 11 I 1 10 I 31 I0 41 I I I 1 70I I 3 80I 4 I 5 90 I
11 l p l Ili yli, y li I y, y lj lj lj d W(I il ;I jl ) p l d W(I il ;I jl ) [0; 5] [60,70] [ 5;10] [70;73.3] [10; 16.6] [73.3;80] [16.6;0] [80;83.3] [0;30] [83.3;90] [30;40] [90;100] d Y Y d I I c c r r m m (, ) (, ) ( ) ( ) W i j pl li lj pl li lj li lj l1 l1 3 With y y y y Ili yli, y li ; cli ; rli li li li li center and radius of I li Verde, Irpino, COMPSTAT006
12 Multivariate distance Assuming independence among p variables we propose the an extension of d W distance to the multivariate case p=
13 Average histogram based on distance d W Wasserstein The barycenter (average) histogram Y(b) of a set of histogram data Y(i) (i=1,, n) can be computed minimizing the Sum of Square n Distances f ( Y( b*)) dw ( Y( i), Y( b)) i1 (like for a cloud of points in classical data analysis) n m 1 f ( Y( b*)) p ( c c ) ( r r ) i1 l1 3 l li lb li lb That is minimized when the usual first order conditions are satisfied: n f p c c 0 clb f p r lb 3 l li lb n n i1 1 1 clb n li ; n c rlb n rli i1 i1 lrli rlb 0 i Y( b) c r ; c r, p ;...; c r ; c r, p ;...; c r ; c r, p b b b b lb lb lb lb l mb mb mb mb m
14 Variance of the set of histograms Y(i) Determined Y(b) through the minimization problem n * arg min W ( ( ), ( )) i1 b d Y i Y b The variance of the set of histogram data is: n m pl ( cli clb ) ( rli rlb ) i1 l1 3 1
15 QQ plot Mean, variance and correlation Mean and variance of the quantile function: x x ( t) dt x p c i i i l il 0 l 1 1 x () i xi t dt xi x i l cil ril x i 3 0 l p Correlation between quantile functions (x i, x j ) 1 x ( t) x ( t) dt x x 1 p c c r r xx i j i j l il jl il jl i j 0 l 3 ( xi, xj ) ( xi, x j ) x x i j i j
16 Interpretative decomposition of the distance in three components related to the location size and shape parameters W ( i, j ) : Fi ( ) Fj ( ) 0 d x x t t dt x x (1 ( x, x )) i j i j i j Location Size Shape i j In general case of distributions If the two distributions have the same shape: d ( x, x ) : x x W i j i j i j Location Size If they have the same size and shape: d ( x, x ) : x x W i j i j Location
17 Clustering methods for Histogram data based on Wasserstein distance Dynamic Clustering Algorithm Hierarchical method (Ward criterion)
18 Dynamic Clustering Algorithm general schema of the Nuées Dynamiques (Diday 197) The algorithm aims to obtain a partition P of a set E of symbolic data in k clusters and a set L of k prototypes {G 1,,G i,,g k } that best represent the clusters {C 1,,C i,,c k } of the partition P The algorithm optimizes a criterion D of fitting between prototypes and clusters: * * D ( P, L ) Min D( P, L) P P, L k L k where : P k is the set of all the partitions of E into k clusters and L k is the set of k prototypes. The algorithm executes alternatively: an allocation step and a representation step In this case E is a set of histogram data
19 Wasserstein distance for Clustering data according to Dynamic Clustering algorithm classical schema (a) Initialization k prototypes Y(b 1 ),...,Y(b K ) of L are randomly chosen (b) Allocation step For each histogram Y(i) of E the allocation index l to the clusters is computed and Y(i) is assigned to the cluster C l where: (c) Representation step For each cluster C k is identified the prototype Y(b k ) of L that minimizes D ( C, Y( b )) d ( Y( i), Y( b )) k k W k ic k l arg min d ( Y( i), Y( b )) k1,...,k W k (b) and (c) are repeated until the convergence Wasserstein l distance Histogram prototype/ barycenter of histograms belonging to the cluster C k
20 Property of Wasserstein distance: Inertia decomposition Being the prototype the barycenter and Euclidean distance, then is a squared is the inertia for grouped data Then, according to the Huygens theorem TI can be decomposed in Within and Between inertia d W
21 Some results Application on US monthly temperatures data set We have considered a dataset constituted by the Monthly Average Temperatures recorded in the 48 states of US from 1895 to 004 (Hawaii and Alaska are not present in the dataset). The analysis consists of the following three steps: 1. Representation of the distributions of temperatures of each State for each month by means of histograms;. Computing of the distance matrix using d W; 3. DCA procedure to find the best partition P 4. Calinski Harabaz index is computed to compute the optimal number k of clusters 5. Hierarchical clustering procedure based on the Ward criterion The original dataset is freely available at the National Climatic Data Center website
22 Histogram barycenter (prototype) Montana February Florida Barycenter (prototype)
23 Experimental results Dynamic Clastering algorithm results: The first 5 columns represent the Quality Partition index (min, max, means, median) on 100 iterations the last one the best value of Calinski-Harabaz index
24 Hierarchical clustering according to the Ward criterion - based on Wasserstein distance For example, we can apply the Ward criterion for a hierarchical clustering of the set E of histogram data: the procedure joins those two classes which minimize
25 Hierarchical tree
26 Final results
27 Regression model for histogram variables based on Wasserstein distance
28 A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? Billard, Diday, IFCS 006 Verde, Irpino, COMPSTAT 010; CLADAG 011 Dias, Brito, ISI 011
29 Regression with histograms variables: a proposal in SDA framework A solution was given by Billard and Diday (006) The model fit a linear regression line throught the mixture of the n bivariate distributions Given a punctual value of X it is possible to predict the punctual value of Y
30 Linear Regression Model for histogram data: our approach involves SDA as well as Functional Data Analysis (Verde, Irpino, 010) Given a histogram variable X, we search for a linear trasformation of X which allows us to predict the histogram variable Y For example: given the histogram of the temperatures observed in a region during a month, is it possible to predict the distribution of the temperature of another month using a linear transformation of the histogram variable? A histogram by a histogram
31 Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p y ( t) b b x ( t) e ( t) i 0 j ij i j1 Quantile functions associated to histogram/ distribution data Y( t) X ( t) b e( t) This formulation is analogous to the functional linear model (Ramsay, Silverman, 003) except for the constant b parameters and for the functions y i (t), x ij (t) which are quantile functions while each e i (t) is a residual function (distribution?) for all i=1,, n.
32 Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: W 0 i j i j d x,x F ( t ) F ( t ) dt Wasserstein l distance between two quantile functions
33 Interpretation of the Linear Regression model The regression model is here proposed to find the best linear transformation of the X j (t) s in order to predict Y(t) y i (t) = b 0 + b 1 x 1i (t) + + b p x pi (t) + e i (t) y i (t), x 1i (t),, x pi (t) are quantile functions and the estimated response variable yˆ i () t is again a quantile function = linear combination of quantile functions
34 Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: yˆ ( t) b b x ( t) t [0, 1] i 0 j ij j1 p The linear transformation is unique: the parameters b 0 and b j are estimated for all the x ij and y i distributions A first problem: Only if b j > 0 a quantile function yˆ i () t can be derived. In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm.
35 Solution The quantile function can be decomposed as: x c ( t) x x ( t) where ij ij ij c x ( t) x ( t) x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i b0 b j ij g j ij i j1 j1 y ( t) x x ( t) ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (b 0,b j ; g j ).
36 Least Squares parameters estimation n n arg min f ( b, ) ( ), ˆ j g j ei () t dw yi t yi ( t) bj, g j i1 i1 n 1 p p c c SSE f ( b j, g j ) yi yi ( t) b0 b jxij g ijxi ( t) dt i1 0 j1 j1 Matrix notation: 1 c c SSE Y Y ( t) X X ( t) dt 0
37 The estimated parameters x y ny x ( x, y ) ˆ b y ˆ b x; ˆ b ; ˆ g n Correlation between quantile functions x i i i i i xi y i1 i1 n 1 n xi nx xi i1 i1 y i n i It is easy to see that: ˆ b, ˆ b and ˆ g
38 Multiple regression estimated parameters According to the properties of the Wasserstein distance between 1 c c quantile functions, the elements of the product matrix X ( t)' X ( t) computed according to our definition of a inner product operator between two quantile functions x ij and x i k : Then, the parameters are estimated under the constraint j (j=1,, p) using the NNLS (Lawson, Hanson, 1974). 1 ˆ X ' X ' X Y ˆ X ( t)' X ( t) dt X ( t)' Y ( t) dt 1 0 x ( t) x ( t) dt ( x, x ) ij i' k ij i' k x x c c c c 0 0 X c otherwise ij X c i ' k 0 if g 0 j ˆ g 0 it is a transformed matrix by NNLS
39 Interpretation of the parameters Regression parameters for the distribution mean locations ˆ b ˆ,..., ˆ 0, b1 b p Shrinking factors for the variability ˆ 1,..., ˆp g g y > 1 (< 1) the ˆi histogram has a greater (smaller) variability than the x ij histogram.
40 Tools for the interpretation The sum of squares of Y is n n 1 ( ) W i ( ), ( ) i ( ) ( ) i1 i1 0 SS Y d y t y t y t y t dt In classical regression model, the SS(Y) is constituted by: SS Error +SS Regression
41 Decomposition of SS(Y) Being: we obtain p p c i b0 b j ij g j ij j1 j1 yˆ ( t) x x ( t) n n 1 ( ) W i ( ), ( ) i( ) i( ) i1 i1 0 ˆ SS Y d y t y t y t y t dt n 1 1 i i1 0 0 i1 Error y( t) yˆ ( t) dt n y( t) e ( t) dt SS Regression n 1 e ( t) ( y ( ) ˆ i t yi( t)) t [0,1] n SS Average error function Bias
42 The bias The bias is due to different shapes of distributions: p c bias n ˆ y g j( x j( t), y( t)) c x y j j1 Correlation between the average quantile functions bias=0 when all the histograms data have the same shape That represents the incapacity of the linear transformation of fitting distributions that are very different in shape
43 A measure of fitting Pseudo R Considering that n 1 1 n c c ˆ ˆ Regression i i i i1 i1 0 0 SS y y y ( t) y ( t) dt y( t) e( t) dt We propose the following pseudo R PseudoR SS min max 0;1 Error ;1. SS( Y)
44 An application on the Ozone concentration in 78 USA sites ( Hourly monitoring of Ozone ground-level concentration and other meteorological variables
45 Forecasting OZONE levels Ozone is a gas that can cause respiratory deseases. In the literature there exists studies that relates the OZONE level to the Temperature, the Wind speed and the Solar radiation. Given the distribution of TEMPERATURE (C ), the distribution of WIND SPEED (meters per second) and the distribution of SOLAR RADIATION (watt per square meter), the main objective is to predict the distribution of OZONE CONCENTRATION (particles per billion) using a linear model. We have chosen as period of observation the Summer seasons of 010 and the central hours of the days (10 a.m. 5 p.m.). The model is: OZONEt ( ) b b TEMP b WSPEED b SORAD g TEMP ( t) g WSPEED ( t) g SORAD ( t) e () t c c c 1 3
46 The estimated model OZONE() t Wasserstein based LS (WASS-LS) (the current proposal) In parentesis the 95% Bootstrap C.I TEMP WSPEED SORAD -1.4; ; ,8;,34 0,049;0,09 c c c TEMP ( t) WSPEED ( t) SORAD ( t) 0,49;1,36 1,08;3,0 0,0118;0,04 Billard (006) regression (SREG) OZONE 6, , 358TEMP 1, 555 WSPEED 0, 0086 SORAD [19.17;34.74] [ 0.03;0.51] [0.78;3.5] [0.005;0.0117]
47 Diagnostics Diagnostics WASS- LS SREG Rsquare NA 0.08 Pseudo Rsquare * W (Dias-Brito 011) 0.74 NA RMSE_W ,93* RMSE_L * RMSE. n 1 ˆ L n d Fi i i1 d F ( y) Fˆ ( y) dy i i ( y), F ( y) W n 1i n 1i W W d yˆ ( t), Y d y ( t), Y i i n 1 ˆ W W i t i i1 n 1 where Y n yi i1 RMSE n d y ( ), y ( t) * The Billard model does not allow to compute directly a distribution. Given a set of input distributions, a Montecarlo experiment is needed to compute an output distribution.
48 Main references BILLARD, L. and DIDAY, E. (006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley Series in Computational Statistics. John Wiley & Sons. BOCK, H.H. and DIDAY, E. (000): Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation, Springer-Verlag. CUESTA-ALBERTOS, J.A., MATRAN, C., TUERO-DIAZ, A. (1997): Optimal transportation plans and convergence in distribution. Journ. of Multiv. An., 60, GIBBS, A.L. and SU, F.E. (00): On choosing and bounding probability metrics. Intl. Stat. Rev. 7 (3), IRPINO, A., LECHEVALLIER, Y. and VERDE, R. (006): Dynamic clustering of histograms using Wasserstein metric. In: Rizzi, A., Vichi, M. (eds.) COMPSTAT 006. Physica-Verlag, Berlin, VERDE, R. and IRPINO, A.(008): Comparing Histogram data using a Mahalanobis Wasserstein distance. In: Brito, P. (eds.) COMPSTAT 008. Physica Verlag, Springer, Berlin, VERDE, R. and IRPINO, A.(010): Ordinary Least Squares for Histogram Data based on Wasserstein Distance COMPSTAT 010 DIAS, S. and BRITO P.,(011) A new linear regression model for histogram-valued variables, ISI 011
49 Thank you Rosanna Verde Antonio Irpino
Histogram data analysis based on Wasserstein distance
Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances
More informationHow to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data
for aggregated data Rosanna Verde (rosanna.verde@unina2.it) Antonio Irpino (antonio.irpino@unina2.it) Dominique Desbois (desbois@agroparistech.fr) Second University of Naples Dept. of Political Sciences
More informationA new linear regression model for histogram-valued variables
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do
More informationOrder statistics for histogram data and a box plot visualization tool
Order statistics for histogram data and a box plot visualization tool Rosanna Verde, Antonio Balzanella, Antonio Irpino Second University of Naples, Caserta, Italy rosanna.verde@unina.it, antonio.balzanella@unina.it,
More informationDistributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid
Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid Introduction 2 Symbolic data Symbolic data was introduced by Edwin
More informationMallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data
Metodološki zvezki, Vol. 9, No. 2, 212, 17-118 Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Katarina Košmelj 1 and Lynne Billard 2 Abstract Mallows L 2 distance
More informationModelling and Analysing Interval Data
Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss
More informationProximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2
Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2 1 Dipartimento di Studi Europei e Mediterranei, Seconda Università
More informationNew Clustering methods for interval data
New Clustering methods for interval data Marie Chavent, Francisco de A.T. De Carvahlo, Yves Lechevallier, Rosanna Verde To cite this version: Marie Chavent, Francisco de A.T. De Carvahlo, Yves Lechevallier,
More informationConfidence Intervals, Testing and ANOVA Summary
Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0
More informationLinear Regression Model with Histogram-Valued Variables
Linear Regression Model with Histogram-Valued Variables Sónia Dias 1 and Paula Brito 1 INESC TEC - INESC Technology and Science and ESTG/IPVC - School of Technology and Management, Polytechnic Institute
More informationDiscriminant Analysis for Interval Data
Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationA Resampling Approach for Interval-Valued Data Regression
A Resampling Approach for Interval-Valued Data Regression Jeongyoun Ahn, Muliang Peng, Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA, 30602, USA Yongho Jeon Department of Applied
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationClustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University
Clustering and Model Integration under the Wasserstein Metric Jia Li Department of Statistics Penn State University Clustering Data represented by vectors or pairwise distances. Methods Top- down approaches
More informationLecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1
Lecture Simple Linear Regression STAT 51 Spring 011 Background Reading KNNL: Chapter 1-1 Topic Overview This topic we will cover: Regression Terminology Simple Linear Regression with a single predictor
More informationForecasting Complex Time Series: Beanplot Time Series
COMPSTAT 2010 19 International Conference on Computational Statistics Paris-France, August 22-27 Forecasting Complex Time Series: Beanplot Time Series Carlo Drago and Germana Scepi Dipartimento di Matematica
More informationOn central tendency and dispersion measures for intervals and hypercubes
On central tendency and dispersion measures for intervals and hypercubes Marie Chavent, Jérôme Saracco To cite this version: Marie Chavent, Jérôme Saracco. On central tendency and dispersion measures for
More informationSTAT 540: Data Analysis and Regression
STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State
More informationPrinciples of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationGeneralized Ward and Related Clustering Problems
Generalized Ward and Related Clustering Problems Vladimir BATAGELJ Department of mathematics, Edvard Kardelj University, Jadranska 9, 6 000 Ljubljana, Yugoslavia Abstract In the paper an attempt to legalize
More informationCo-clustering algorithms for histogram data
Co-clustering algorithms for histogram data Algoritmi di Co-clustering per dati ad istogramma Francisco de A.T. De Carvalho and Antonio Balzanella and Antonio Irpino and Rosanna Verde Abstract One of the
More informationMultiple Linear Regression for the Supervisor Data
for the Supervisor Data Rating 40 50 60 70 80 90 40 50 60 70 50 60 70 80 90 40 60 80 40 60 80 Complaints Privileges 30 50 70 40 60 Learn Raises 50 70 50 70 90 Critical 40 50 60 70 80 30 40 50 60 70 80
More informationA Nonparametric Kernel Approach to Interval-Valued Data Analysis
A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationLasso-based linear regression for interval-valued data
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS065) p.5576 Lasso-based linear regression for interval-valued data Giordani, Paolo Sapienza University of Rome, Department
More informationarxiv: v1 [stat.ml] 23 Jul 2015
CLUSTERING OF MODAL VALUED SYMBOLIC DATA VLADIMIR BATAGELJ, NATAŠA KEJŽAR, AND SIMONA KORENJAK-ČERNE UNIVERSITY OF LJUBLJANA, SLOVENIA arxiv:157.6683v1 [stat.ml] 23 Jul 215 Abstract. Symbolic Data Analysis
More informationAccounting for measurement uncertainties in industrial data analysis
Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos,
More informationGeneral Linear Model Introduction, Classes of Linear models and Estimation
Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)
More informationPrincipal Component Analysis for Interval Data
Outline Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account Outline Outline 1 Introduction to
More informationRegression for Symbolic Data
Symbolic Data Analysis: Taking Variability in Data into Account Regression for Symbolic Data Sónia Dias I.P. Viana do Castelo & LIAAD-INESC TEC, Univ. Porto Paula Brito Fac. Economia & LIAAD-INESC TEC,
More informationRemedial Measures for Multiple Linear Regression Models
Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25 Outline
More informationFinding Multiple Outliers from Multidimensional Data using Multiple Regression
Finding Multiple Outliers from Multidimensional Data using Multiple Regression L. Sunitha 1*, Dr M. Bal Raju 2 1* CSE Research Scholar, J N T U Hyderabad, Telangana (India) 2 Professor and principal, K
More informationcomplex data Edwin Diday, University i Paris-Dauphine, France CEREMADE, Beijing 2011
Symbolic data analysis of complex data Edwin Diday, CEREMADE, University i Paris-Dauphine, France Beijing 2011 OUTLINE What is the Symbolic Data Analysis (SDA) paradigm? Why SDA is a good tool for Complex
More informationPart 8: GLMs and Hierarchical LMs and GLMs
Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course
More informationAutocorrelation function of the daily histogram time series of SP500 intradaily returns
Autocorrelation function of the daily histogram time series of SP5 intradaily returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA 9252 Javier Arroyo
More informationOPTIMAL TRANSPORTATION PLANS AND CONVERGENCE IN DISTRIBUTION
OPTIMAL TRANSPORTATION PLANS AND CONVERGENCE IN DISTRIBUTION J.A. Cuesta-Albertos 1, C. Matrán 2 and A. Tuero-Díaz 1 1 Departamento de Matemáticas, Estadística y Computación. Universidad de Cantabria.
More informationSimple Linear Regression for the Advertising Data
Revenue 0 10 20 30 40 50 5 10 15 20 25 Pages of Advertising Simple Linear Regression for the Advertising Data What do we do with the data? y i = Revenue of i th Issue x i = Pages of Advertisement in i
More informationSupport Vector Method for Multivariate Density Estimation
Support Vector Method for Multivariate Density Estimation Vladimir N. Vapnik Royal Halloway College and AT &T Labs, 100 Schultz Dr. Red Bank, NJ 07701 vlad@research.att.com Sayan Mukherjee CBCL, MIT E25-201
More informationLecture 3: Inference in SLR
Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals
More informationClustering Lecture 1: Basics. Jing Gao SUNY Buffalo
Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering
More informationChapter 11 Estimation of Absolute Performance
Chapter 11 Estimation of Absolute Performance Purpose Objective: Estimate system performance via simulation If q is the system performance, the precision of the estimator qˆ can be measured by: The standard
More informationForecasting Wind Ramps
Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators
More informationFormal Statement of Simple Linear Regression Model
Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationSTAT Chapter 11: Regression
STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship
More informationUnit 10: Simple Linear Regression and Correlation
Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for
More informationLecture 9 SLR in Matrix Form
Lecture 9 SLR in Matrix Form STAT 51 Spring 011 Background Reading KNNL: Chapter 5 9-1 Topic Overview Matrix Equations for SLR Don t focus so much on the matrix arithmetic as on the form of the equations.
More informationLINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises
LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on
More informationGov 2002: 5. Matching
Gov 2002: 5. Matching Matthew Blackwell October 1, 2015 Where are we? Where are we going? Discussed randomized experiments, started talking about observational data. Last week: no unmeasured confounders
More informationReview of Statistics 101
Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods
More informationVerification of Continuous Forecasts
Verification of Continuous Forecasts Presented by Barbara Brown Including contributions by Tressa Fowler, Barbara Casati, Laurence Wilson, and others Exploratory methods Scatter plots Discrimination plots
More informationImportance of Input Data and Uncertainty Associated with Tuning Satellite to Ground Solar Irradiation
Importance of Input Data and Uncertainty Associated with Tuning Satellite to Ground Solar Irradiation James Alfi 1, Alex Kubiniec 2, Ganesh Mani 1, James Christopherson 1, Yiping He 1, Juan Bosch 3 1 EDF
More informationSTAT 4385 Topic 03: Simple Linear Regression
STAT 4385 Topic 03: Simple Linear Regression Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2017 Outline The Set-Up Exploratory Data Analysis
More informationHomework 2: Simple Linear Regression
STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA
More informationStat 602 Exam 1 Spring 2017 (corrected version)
Stat 602 Exam Spring 207 (corrected version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a very long Exam. You surely won't be able to
More informationDescriptive Statistics for Symbolic Data
Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationSimple Linear Regression for the Climate Data
Prediction Prediction Interval Temperature 0.2 0.0 0.2 0.4 0.6 0.8 320 340 360 380 CO 2 Simple Linear Regression for the Climate Data What do we do with the data? y i = Temperature of i th Year x i =CO
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationTrends in the Relative Distribution of Wages by Gender and Cohorts in Brazil ( )
Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil (1981-2005) Ana Maria Hermeto Camilo de Oliveira Affiliation: CEDEPLAR/UFMG Address: Av. Antônio Carlos, 6627 FACE/UFMG Belo
More informationNew models for symbolic data analysis
New models for symbolic data analysis B. Beranger, H. Lin and S. A. Sisson arxiv:1809.03659v1 [stat.co] 11 Sep 2018 Abstract Symbolic data analysis (SDA) is an emerging area of statistics based on aggregating
More informationMore on Unsupervised Learning
More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data
More informationSensitivity Analysis When Model Outputs Are Functions
LAUR-04-1991 Sensitivity Analysis When Model Outputs Are Functions Michael D. McKay and Katherine Campbell formerly of the Statistical Sciences Group Los Alamos National Laboratory Presented at the Fourth
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationTable of Contents. Multivariate methods. Introduction II. Introduction I
Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation
More informationMathematics for Economics MA course
Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between
More informationthe logic of parametric tests
the logic of parametric tests define the test statistic (e.g. mean) compare the observed test statistic to a distribution calculated for random samples that are drawn from a single (normal) distribution.
More informationModule Master Recherche Apprentissage et Fouille
Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols http://tao.lri.fr 19 novembre 2008 Unsupervised Learning Clustering Data Streaming Application: Clustering
More informationstatistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:
Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility
More informationIntensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis
Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 5 Topic Overview 1) Introduction/Unvariate Statistics 2) Bootstrapping/Monte Carlo Simulation/Kernel
More informationOptimal Transport and Wasserstein Distance
Optimal Transport and Wasserstein Distance The Wasserstein distance which arises from the idea of optimal transport is being used more and more in Statistics and Machine Learning. In these notes we review
More informationSmall distortion and volume preserving embedding for Planar and Euclidian metrics Satish Rao
Small distortion and volume preserving embedding for Planar and Euclidian metrics Satish Rao presented by Fjóla Rún Björnsdóttir CSE 254 - Metric Embeddings Winter 2007 CSE 254 - Metric Embeddings - Winter
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationRobust model selection criteria for robust S and LT S estimators
Hacettepe Journal of Mathematics and Statistics Volume 45 (1) (2016), 153 164 Robust model selection criteria for robust S and LT S estimators Meral Çetin Abstract Outliers and multi-collinearity often
More informationData Mining Techniques
Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!
More informationTopic 14: Inference in Multiple Regression
Topic 14: Inference in Multiple Regression Outline Review multiple linear regression Inference of regression coefficients Application to book example Inference of mean Application to book example Inference
More informationCOMPARISON OF GMM WITH SECOND-ORDER LEAST SQUARES ESTIMATION IN NONLINEAR MODELS. Abstract
Far East J. Theo. Stat. 0() (006), 179-196 COMPARISON OF GMM WITH SECOND-ORDER LEAST SQUARES ESTIMATION IN NONLINEAR MODELS Department of Statistics University of Manitoba Winnipeg, Manitoba, Canada R3T
More informationData Exploration and Unsupervised Learning with Clustering
Data Exploration and Unsupervised Learning with Clustering Paul F Rodriguez,PhD San Diego Supercomputer Center Predictive Analytic Center of Excellence Clustering Idea Given a set of data can we find a
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationAsymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands
Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley 1 and Sudipto Banerjee 2 1 Department of Forestry & Department of Geography, Michigan
More informationLecture 10 Multiple Linear Regression
Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable
More informationDimension Reduction Techniques. Presented by Jie (Jerry) Yu
Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More informationMatrix Approach to Simple Linear Regression: An Overview
Matrix Approach to Simple Linear Regression: An Overview Aspects of matrices that you should know: Definition of a matrix Addition/subtraction/multiplication of matrices Symmetric/diagonal/identity matrix
More informationAn Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets
IEEE Big Data 2015 Big Data in Geosciences Workshop An Optimized Interestingness Hotspot Discovery Framework for Large Gridded Spatio-temporal Datasets Fatih Akdag and Christoph F. Eick Department of Computer
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Alan Gelfand 1 and Andrew O. Finley 2 1 Department of Statistical Science, Duke University, Durham, North
More informationPost-exam 2 practice questions 18.05, Spring 2014
Post-exam 2 practice questions 18.05, Spring 2014 Note: This is a set of practice problems for the material that came after exam 2. In preparing for the final you should use the previous review materials,
More informationIntensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis
Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 4 Spatial Point Patterns Definition Set of point locations with recorded events" within study
More informationHierarchical models for the rainfall forecast DATA MINING APPROACH
Hierarchical models for the rainfall forecast DATA MINING APPROACH Thanh-Nghi Do dtnghi@cit.ctu.edu.vn June - 2014 Introduction Problem large scale GCM small scale models Aim Statistical downscaling local
More informationProbabilistic forecasting of solar radiation
Probabilistic forecasting of solar radiation Dr Adrian Grantham School of Information Technology and Mathematical Sciences School of Engineering 7 September 2017 Acknowledgements Funding: Collaborators:
More informationInvestigation of Monthly Pan Evaporation in Turkey with Geostatistical Technique
Investigation of Monthly Pan Evaporation in Turkey with Geostatistical Technique Hatice Çitakoğlu 1, Murat Çobaner 1, Tefaruk Haktanir 1, 1 Department of Civil Engineering, Erciyes University, Kayseri,
More informationBayesian spatial quantile regression
Brian J. Reich and Montserrat Fuentes North Carolina State University and David B. Dunson Duke University E-mail:reich@stat.ncsu.edu Tropospheric ozone Tropospheric ozone has been linked with several adverse
More informationLeast Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD
Least Squares Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 75A Winter 0 - UCSD (Unweighted) Least Squares Assume linearity in the unnown, deterministic model parameters Scalar, additive noise model: y f (
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota,
More informationTime Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP500 Intradaily Returns
Time Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP5 Intradaily Returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA
More informationChapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves
Chapter 3 Riemannian Manifolds - I The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves embedded in Riemannian manifolds. A Riemannian manifold is an abstraction
More informationCorrelation and regression
1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,
More informationMSc / PhD Course Advanced Biostatistics. dr. P. Nazarov
MSc / PhD Course Advanced Biostatistics dr. P. Nazarov petr.nazarov@crp-sante.lu 04-1-013 L4. Linear models edu.sablab.net/abs013 1 Outline ANOVA (L3.4) 1-factor ANOVA Multifactor ANOVA Experimental design
More informationCoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;
CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.
More information