Histogram data analysis based on Wasserstein distance
|
|
- Christopher Chase
- 6 years ago
- Views:
Transcription
1 Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY
2 Aims Introduce: New distances to compare histogram (distribution) data in the framework of the Symbolic Data Analysis especially the Wasserstein Mallows l distance Show: Some properties of Wasserstein Mallows l distance Present: Methodological approaches based on l distance Basic Statistics for histogram data (including intervals and distributions as particular cases) Clustering methods:dca and hierarchical (Ward criterion) Linear regression model: l as metric for Sum Square Errors in OLS estimation method
3 Main Sources of histogram data Results of summary/clustering procedures From surveys From large databases From sensors 0,6 0,5 0,5 Temperatures Pollutant concentration Network activity 0,4 0,3 0, 0,1 0, 0,1 0, Data streams 0 Description of time window data sequences Image analysis Color bandwidths 0,6 0,5 0,4 0,5 0,4 Confidentiality data Summary data non punctual 0,3 0, 0,1 0 0, 0,1 0, 0, 0,1 0,1
4 Histogram data as a particular case of modal symbolic descriptions [Bock and Diday (000) ] Histogram data is a model for representing the empirical distribution of a continuous variable Y partitioned into a set of contiguous I h intervals (bins) with associated h weights. A histogram is represented by a set of H ordered pairs (I h, h )} such that: Ihi yh; y h ; yh yh ; yh, yh h1,..., H ; max I min y y h h' I I hi h h h1,..., H h1,..., H h h' 0 h h1,..., H h 1 Y(i)=[([0-10],0.);([10-0],0.5);([0-30],0.1); ([30-40],0.)] 0,6 0,5 0,4 0,3 0, 0,1 0 0,5 0, 0, 0,
5 Comparison of two histogram data How do compare two units described by two histograms? A possibility is to use metrics developed for comparing probability distributions 0,5 0,4 0,4 0,45 0,3 0,3 0,3 0, 0,1 0, 0,1 0,15 0,
6 Metrics used for the evaluation of the convergence of two probability measures (Gibbs and Su, 00) Given: - a domain on which is possible to define a Borel -algebra, - two measure, - the density functions f and g - the corresponding distribution functions F and G and - a subdominant measure, like: =(+)/, Gibbs and Su (00) present a review of the most used dissimilarities:
7 A suitable measure to compute the distance between histograms: Wasserstein-Kantorovich metric we propose to use the Wasserstein-Kantorovich metric: W 1 1 1, ( ) ( ) d F t G t dt 0 in particular the derived l Mallow s distance between two quantile functions W, ( ) ( ) d F t G t dt 0 The main difficulties to compute this distance is the analytical definition of the quantile function But in our case we treat especially with histogram data), indeed
8 Histograms are locally uniform Having represented an histogram Y(i) as {(I 1, 1 ),,(I h, h ),, (I m, H )} (where H is the number of intervals of the support), we may define: the empirical density function f(y) as: h f( y) if y [ yh; yh) f y h h yh 0 otherwise h the empirical distribution function w h h k k 1 0 if h 0 if h1,..., m hence the empirical quantile function (The inverse of the distribution function) y h y wh wh 1 F( y) wh y y iff y y y h h y y F( y) h t h h h 1 h1 F t y y h h y w h h t wh 1 wh wh 1 t w ( ) where 0 1 F 1 () t
9 Geometric interpretation The comparison of two quantile functions associated with two histograms requires the partition of the support of the two qf s into m intervals with associated uniform density 1,00 0,95 0,90 0,85 0,9 t=0.90 0,9 0,80 0,75 0,70 0,7 t=0.70 0,65 0,60 0,6 0,55 t=0.60 0,50 0,45 3 0,40 0,35 t=0.30 0,30 0,3 0,5 0,0 0,15 0,10 t=0.15 0,15 0,05 0, I 11 I 1 10 I 31 I0 41 I I I 1 70I I 3 80I 4 I 5 90 I
10 l l Ili yli, y li I y, y lj lj lj d W(I il ;I jl ) l d W(I il ;I jl ) [0; 5] [60,70] [ 5;10] [70;73.3] [10; 16.6] [73.3;80] [16.6;0] [80;83.3] [0;30] [83.3;90] [30;40] [90;100] d Y Y d I I c c r r m m (, ) (, ) ( ) ( ) W i j l li lj l li lj li lj l1 l1 3 With y y y y Ili yli, y li ; cli ; rli li li li li center and radius of I li Verde, Irpino, COMPSTAT006
11 Multivariate distance Assuming independence among p variables we propose the an extension of d W distance to the multivariate case p=
12 Average histogram based on distance d W Wasserstein The barycenter (average) histogram Y(b) of a set of histogram data Y(i) (i=1,, n) can be computed minimizing the Sum of Square n Distances f ( Y( b*)) dw ( Y( i), Y( b)) i1 (like for a cloud of points in classical data analysis) n m 1 f ( Y( b*)) ( c c ) ( r r ) i1 l1 3 l li lb li lb That is minimized when the usual first order conditions are satisfied: n f c c 0 clb f r lb 3 l li lb n n i1 1 1 clb n li ; n c rlb n rli i1 i1 lrli rlb 0 i Y( b) c r ; c r, ;...; c r ; c r, ;...; c r ; c r, b b b b lb lb lb lb l mb mb mb mb m
13 Variance of the set of histograms Y(i) Determined Y(b) through the minimization problem n * arg min W ( ( ), ( )) i1 b d Y i Y b The variance of the set of histogram data is: n m l ( cli clb ) ( rli rlb ) i1 l1 3 1
14 Clustering methods for Histogram data based on Wasserstein distance Dynamic Clustering Algorithm Hierarchical method (Ward criterion)
15 Dynamic Clustering Algorithm general schema of the Nuées Dynamiques (Diday 197) The algorithm aims to obtain a partition P of a set E of symbolic data in k clusters and a set L of k prototypes {G 1,,G i,,g k } that best represent the clusters {C 1,,C i,,c k } of the partition P The algorithm optimizes a criterion D of fitting between prototypes and clusters: * * D ( P, L ) Min D( P, L) P P, L k L k where : P k is the set of all the partitions of E into k clusters and L k is the set of k prototypes. The algorithm executes alternatively: an allocation step and a representation step In this case E is a set of histogram data
16 Wasserstein distance for Clustering data according to Dynamic Clustering algorithm classical schema (a) Initialization k prototypes Y(b 1 ),...,Y(b K ) of L are randomly chosen (b) Allocation step For each histogram Y(i) of E the allocation index l to the clusters is computed and Y(i) is assigned to the cluster C l where: (c) Representation step For each cluster C k is identified the prototype Y(b k ) of L that minimizes D ( C, Y( b )) d ( Y( i), Y( b )) k k W k ic k l arg min d ( Y( i), Y( b )) k1,...,k W k (b) and (c) are repeated until the convergence Wasserstein l distance Histogram prototype/ barycenter of histograms belonging to the cluster C k
17 Property of Wasserstein distance: Inertia decomposition Being the prototype the barycenter and Euclidean distance, then d W is a squared is the inertia for grouped data Then, according to the Huygens theorem TI can be decomposed in Within and Between inertia
18 Some results Application on US monthly temperatures data set We have considered a dataset constituted by the Monthly Average Temperatures recorded in the 48 states of US from 1895 to 004 (Hawaii and Alaska are not present in the dataset). The analysis consists of the following three steps: 1. Representation of the distributions of temperatures of each State for each month by means of histograms;. Computing of the distance matrix using d W; 3. DCA procedure to find the best partition P 4. Calinski Harabaz index is computed to compute the optimal number k of clusters 5. Hierarchical clustering procedure based on the Ward criterion The original dataset is freely available at the National Climatic Data Center website
19 Histogram barycenter (prototype) Montana February Florida Barycenter (prototype)
20 Experimental results Dynamic Clastering algorithm results: The first 5 columns represent the Quality Partition index (min, max, means, median) on 100 iterations the last one the best value of Calinski-Harabaz index
21 Hierarchical clustering according to the Ward criterion - based on Wasserstein distance For example, we can apply the Ward criterion for a hierarchical clustering of the set E of histogram data: the procedure joins those two classes which minimize
22 Hierarchical tree
23 Final results and coloured map
24 Regression model for histogram variables based on Wasserstein distance
25 A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? Billard, Diday, IFCS 006 Verde, Irpino, COMPSTAT 010; CLADAG 011 Dias, Brito, ISI 011
26 Regression with histograms variables: a proposal in SDA framework A solution was given by Billard and Diday (006) The model fit a linear regression line throught the mixture of the n bivariate distributions Given a punctual value of X it is possible to predict the punctual value of Y
27 Linear Regression Model for histogram data: our approach involves SDA as well as Functional Data Analysis (Verde, Irpino, 010) Given a histogram variable X, we search for a linear trasformation of X which allows us to predict the histogram variable Y For example: given the histogram of the temperatures observed in a region during a month, is it possible to predict the distribution of the temperature of another month using a linear transformation of the histogram variable? A histogram by a histogram
28 Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p y ( t) b b x ( t) e ( t) i 0 j ij i j1 Quantile functions associated to histogram/ distribution data Y( t) X ( t) b e( t) This formulation is analogous to the functional linear model (Ramsay, Silverman, 003) except for the constants b parameters and for the functions y i (t), x ij (t) which are quantile functions while each e i (t) is a residual function (distribution?) for all i=1,, n.
29 Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: W 0 i j i j d x,x F ( t ) F ( t ) dt Wasserstein l distance between two quantile functions
30 Interpretative decomposition of the distance in three components related to the location size and shape parameters W ( i, j ) : Fi ( ) Fj ( ) 0 d x x t t dt x x (1 ( x, x )) i j i j i j Location Size Shape i j In general case of distributions If the two distributions have the same shape: d ( x, x ) : x x W i j i j i j Location Size If they have the same size and shape: d ( x, x ) : x x W i j i j Location
31 Notations x t F t 1 i( ) i ( ) quantile function of x i Mean and variance of the quantile function: 1 xi xi() t dt 0 n n x( t) xi( t) t [0,1]; x xi( t) dt x( t) dt n n i1 i x ( ) ( ) i xi t dt xi xi t dt x x i i 0 0 Correlation between quantile functions (x i, x j ) 1 x ( t) x ( t) dt i j i j 1 0 ( xi, x j ) xi ( t) x j ( t) dt ( xi, x j ) xi x x j ix j x x 0 i j x x
32 If data are histograms the mean, variance and correlation QQ plot formulae are: Empirical quantile function Mean and variance of the quantile function: 1 x x ( t) dt x c i i i l il 0 l Correlation between quantile functions (x i, x j ) 1 1 x () i xi t dt xi x i l cil ril x i 3 0 l 1 x ( t) x ( t) dt x x 1 c c r r xx i j i j l il jl il jl i j 0 l 3 ( xi, xj ) ( xi, x j ) x x i j i j
33 Interpretation of the Linear Regression model The regression model is here proposed to find the best linear transformation of the X j (t) s in order to predict Y(t) y i (t) = b 0 + b 1 x 1i (t) + + b p x pi (t) + e i (t) y i (t), x 1i (t),, x pi (t) are quantile functions and the estimated response variable yˆ i () t is again a quantile function = linear combination of quantile functions - according to the aim of symbolic data analysis: Input Symbolic data Numerical/Symbolic method Output Symbolic data (same nature of the input data)
34 Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: yˆ ( t) b b x ( t) t [0, 1] i 0 j ij j1 p It is worth of noting the linear transformation is unique: the parameters b 0 and b j are estimated for all the x ij and y i distributions A first problem: Only if b j > 0 a quantile function yˆ i () t can be derived. In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm.
35 Solution The quantile function can be decomposed as: x c ( t) x x ( t) where ij ij ij c x ( t) x ( t) x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i b0 b j ij g j ij i j1 j1 y ( t) x x ( t) ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (b 0,b j ; g j ).
36 The error term: a property of the Wasserstein distance decomposition The squared error can be written according to the two components 1 ˆ ˆ i W i, y i i yi 0 e d ( y ) y ( t) ( t) dt ˆ c c y y d ( y, yˆ ) i i W i i
37 Least Squares parameters estimation n n arg min f ( b, ) ( ), ˆ j g j ei () t dw yi t yi ( t) bj, g j i1 i1 n 1 p p c c SSE f ( b j, g j ) yi yi ( t) b0 b jxij g ijxi ( t) dt i1 0 j1 j1 Matrix notation: 1 c c SSE Y Y ( t) X X ( t) dt 0
38 The estimated parameters x y ny x ( x, y ) ˆ b y ˆ b x; ˆ b ; ˆ g n Correlation between quantile functions x i i i i i xi y i1 i1 n 1 n xi nx xi i1 i1 y i n i It is easy to see that: ˆ b, ˆ b and ˆ g
39 Multiple regression estimated parameters According to the properties of the Wasserstein distance between 1 c c quantile functions, the elements of the product matrix X ( t)' X ( t) computed according to our definition of a inner product operator between two quantile functions x ij and x i k : Then, the parameters are estimated under the constraint j (j=1,, p) using the NNLS (Lawson, Hanson, 1974). 1 ˆ X ' X ' X Y ˆ X ( t)' X ( t) dt X ( t)' Y ( t) dt 1 0 x ( t) x ( t) dt ( x, x ) ij i' k ij i' k x x c c c c 0 0 X c otherwise ij X c i ' k 0 if g 0 j ˆ g 0 it is a transformed matrix by NNLS
40 Interpretation of the parameters Regression parameters for the distribution mean locations ˆ b ˆ,..., ˆ 0, b1 b p Shrinking factors for the variability ˆ 1,..., ˆp g g y > 1 (< 1) the ˆi histogram has a greater (smaller) variability than the x ij histogram.
41 Tools for the interpretation The sum of squares of Y is n n 1 ( ) W i ( ), ( ) i ( ) ( ) i1 i1 0 SS Y d y t y t y t y t dt In classical regression model, the SS(Y) is constituted by: SS Error +SS Regression
42 Decomposition of SS(Y) Being: we obtain p p c i b0 b j ij g j ij j1 j1 yˆ ( t) x x ( t) n n 1 ( ) W i ( ), ( ) i( ) i( ) i1 i1 0 ˆ SS Y d y t y t y t y t dt n 1 1 i i1 0 0 i1 Error y( t) yˆ ( t) dt n y( t) e ( t) dt SS Regression n 1 e ( t) ( y ( ) ˆ i t yi( t)) t [0,1] n SS Average error function Bias
43 The bias The bias is due to different shapes of distributions: p c bias n ˆ y g j( x j( t), y( t)) c x y j j1 Correlation between the average quantile functions bias=0 when all the histograms data have the same shape That represents the incapacity of the linear transformation of fitting distributions that are very different in shape
44 A measure of fitting Pseudo R Considering that n 1 1 n c c ˆ ˆ Regression i i i i1 i1 0 0 SS y y y ( t) y ( t) dt y( t) e( t) dt We propose the following pseudo R PseudoR SS min max 0;1 Error ;1. SS( Y)
45 An application on the Ozone concentration in 78 USA sites ( Hourly monitoring of Ozone ground-level concentration and other meteorological variables
46 Forecasting OZONE levels Ozone is a gas that can cause respiratory deseases. In the literature there exists studies that relates the OZONE level to the Temperature, the Wind speed and the Solar radiation. Given the distribution of TEMPERATURE (C ), the distribution of WIND SPEED (meters per second) and the distribution of SOLAR RADIATION (watt per square meter), the main objective is to predict the distribution of OZONE CONCENTRATION (particles per billion) using a linear model. We have chosen as period of observation the Summer seasons of 010 and the central hours of the days (10 a.m. 5 p.m.). The model is: OZONEt ( ) b b TEMP b WSPEED b SORAD g TEMP ( t) g WSPEED ( t) g SORAD ( t) e () t c c c 1 3
47 The estimated model OZONE() t Wasserstein based LS (WASS-LS) (the current proposal) In parentesis the 95% Bootstrap C.I TEMP WSPEED SORAD -1.4; ; ,8;,34 0,049;0,09 c c c TEMP ( t) WSPEED ( t) SORAD ( t) 0,49;1,36 1,08;3,0 0,0118;0,04 Billard (006) regression (SREG) OZONE 6, , 358TEMP 1, 555 WSPEED 0, 0086 SORAD [19.17;34.74] [ 0.03;0.51] [0.78;3.5] [0.005;0.0117]
48 Diagnostics Diagnostics WASS- LS SREG Rsquare NA 0.08 Pseudo Rsquare * (Dias-Brito 011) 0.74 NA RMSE_W ,93* RMSE_L * RMSE. n 1 ˆ L n d Fi i i1 d F ( y) Fˆ ( y) dy i i ( y), F ( y) n 1i n 1i W W d yˆ ( t), Y d y ( t), Y i i n 1 ˆ W W i t i i1 n 1 where Y n yi i1 RMSE n d y ( ), y ( t) * The Billard model does not allow to compute directly a distribution. Given a set of input distributions, a Montecarlo experiment is needed to compute an output distribution.
49 Main references BILLARD, L. and DIDAY, E. (006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley Series in Computational Statistics. John Wiley & Sons. BOCK, H.H. and DIDAY, E. (000): Analysis of Symbolic Data, Exploratory methods for extracting statistical information from complex data. Studies in Classification, Data Analysis and Knowledge Organisation, Springer-Verlag. CUESTA-ALBERTOS, J.A., MATRAN, C., TUERO-DIAZ, A. (1997): Optimal transportation plans and convergence in distribution. Journ. of Multiv. An., 60, GIBBS, A.L. and SU, F.E. (00): On choosing and bounding probability metrics. Intl. Stat. Rev. 7 (3), IRPINO, A., LECHEVALLIER, Y. and VERDE, R. (006): Dynamic clustering of histograms using Wasserstein metric. In: Rizzi, A., Vichi, M. (eds.) COMPSTAT 006. Physica-Verlag, Berlin, VERDE, R. and IRPINO, A.(008): Comparing Histogram data using a Mahalanobis Wasserstein distance. In: Brito, P. (eds.) COMPSTAT 008. Physica Verlag, Springer, Berlin, VERDE, R. and IRPINO, A.(010): Ordinary Least Squares for Histogram Data based on Wasserstein Distance COMPSTAT 010 DIAS, S. and BRITO P.,(011) A new linear regression model for histogram-valued variables, ISI 011
50 Thank you Rosanna Verde Antonio Irpino
51
52 ( X i, X ) j l c c 1 r r 3 s s xx l il jl il jl i j i j
Histogram data analysis based on Wasserstein distance
Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND
More informationHow to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data
for aggregated data Rosanna Verde (rosanna.verde@unina2.it) Antonio Irpino (antonio.irpino@unina2.it) Dominique Desbois (desbois@agroparistech.fr) Second University of Naples Dept. of Political Sciences
More informationOrder statistics for histogram data and a box plot visualization tool
Order statistics for histogram data and a box plot visualization tool Rosanna Verde, Antonio Balzanella, Antonio Irpino Second University of Naples, Caserta, Italy rosanna.verde@unina.it, antonio.balzanella@unina.it,
More informationA new linear regression model for histogram-valued variables
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do
More informationDistributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid
Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid Introduction 2 Symbolic data Symbolic data was introduced by Edwin
More informationMallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data
Metodološki zvezki, Vol. 9, No. 2, 212, 17-118 Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Katarina Košmelj 1 and Lynne Billard 2 Abstract Mallows L 2 distance
More informationProximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2
Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2 1 Dipartimento di Studi Europei e Mediterranei, Seconda Università
More informationModelling and Analysing Interval Data
Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss
More informationNew Clustering methods for interval data
New Clustering methods for interval data Marie Chavent, Francisco de A.T. De Carvahlo, Yves Lechevallier, Rosanna Verde To cite this version: Marie Chavent, Francisco de A.T. De Carvahlo, Yves Lechevallier,
More informationConfidence Intervals, Testing and ANOVA Summary
Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0
More informationLinear Regression Model with Histogram-Valued Variables
Linear Regression Model with Histogram-Valued Variables Sónia Dias 1 and Paula Brito 1 INESC TEC - INESC Technology and Science and ESTG/IPVC - School of Technology and Management, Polytechnic Institute
More informationDiscriminant Analysis for Interval Data
Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationA Resampling Approach for Interval-Valued Data Regression
A Resampling Approach for Interval-Valued Data Regression Jeongyoun Ahn, Muliang Peng, Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA, 30602, USA Yongho Jeon Department of Applied
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationForecasting Complex Time Series: Beanplot Time Series
COMPSTAT 2010 19 International Conference on Computational Statistics Paris-France, August 22-27 Forecasting Complex Time Series: Beanplot Time Series Carlo Drago and Germana Scepi Dipartimento di Matematica
More informationOn central tendency and dispersion measures for intervals and hypercubes
On central tendency and dispersion measures for intervals and hypercubes Marie Chavent, Jérôme Saracco To cite this version: Marie Chavent, Jérôme Saracco. On central tendency and dispersion measures for
More informationCo-clustering algorithms for histogram data
Co-clustering algorithms for histogram data Algoritmi di Co-clustering per dati ad istogramma Francisco de A.T. De Carvalho and Antonio Balzanella and Antonio Irpino and Rosanna Verde Abstract One of the
More informationGeneralized Ward and Related Clustering Problems
Generalized Ward and Related Clustering Problems Vladimir BATAGELJ Department of mathematics, Edvard Kardelj University, Jadranska 9, 6 000 Ljubljana, Yugoslavia Abstract In the paper an attempt to legalize
More informationClustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University
Clustering and Model Integration under the Wasserstein Metric Jia Li Department of Statistics Penn State University Clustering Data represented by vectors or pairwise distances. Methods Top- down approaches
More informationLecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1
Lecture Simple Linear Regression STAT 51 Spring 011 Background Reading KNNL: Chapter 1-1 Topic Overview This topic we will cover: Regression Terminology Simple Linear Regression with a single predictor
More informationPrinciples of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationLasso-based linear regression for interval-valued data
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS065) p.5576 Lasso-based linear regression for interval-valued data Giordani, Paolo Sapienza University of Rome, Department
More informationarxiv: v1 [stat.ml] 23 Jul 2015
CLUSTERING OF MODAL VALUED SYMBOLIC DATA VLADIMIR BATAGELJ, NATAŠA KEJŽAR, AND SIMONA KORENJAK-ČERNE UNIVERSITY OF LJUBLJANA, SLOVENIA arxiv:157.6683v1 [stat.ml] 23 Jul 215 Abstract. Symbolic Data Analysis
More informationA Nonparametric Kernel Approach to Interval-Valued Data Analysis
A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,
More informationRegression for Symbolic Data
Symbolic Data Analysis: Taking Variability in Data into Account Regression for Symbolic Data Sónia Dias I.P. Viana do Castelo & LIAAD-INESC TEC, Univ. Porto Paula Brito Fac. Economia & LIAAD-INESC TEC,
More informationSTAT 540: Data Analysis and Regression
STAT 540: Data Analysis and Regression Wen Zhou http://www.stat.colostate.edu/~riczw/ Email: riczw@stat.colostate.edu Department of Statistics Colorado State University Fall 205 W. Zhou (Colorado State
More informationPart 8: GLMs and Hierarchical LMs and GLMs
Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course
More informationMultiple Linear Regression for the Supervisor Data
for the Supervisor Data Rating 40 50 60 70 80 90 40 50 60 70 50 60 70 80 90 40 60 80 40 60 80 Complaints Privileges 30 50 70 40 60 Learn Raises 50 70 50 70 90 Critical 40 50 60 70 80 30 40 50 60 70 80
More informationGeneral Linear Model Introduction, Classes of Linear models and Estimation
Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)
More informationSimple Linear Regression for the Advertising Data
Revenue 0 10 20 30 40 50 5 10 15 20 25 Pages of Advertising Simple Linear Regression for the Advertising Data What do we do with the data? y i = Revenue of i th Issue x i = Pages of Advertisement in i
More informationLecture 3: Inference in SLR
Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals
More informationTrends in the Relative Distribution of Wages by Gender and Cohorts in Brazil ( )
Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil (1981-2005) Ana Maria Hermeto Camilo de Oliveira Affiliation: CEDEPLAR/UFMG Address: Av. Antônio Carlos, 6627 FACE/UFMG Belo
More informationFinding Multiple Outliers from Multidimensional Data using Multiple Regression
Finding Multiple Outliers from Multidimensional Data using Multiple Regression L. Sunitha 1*, Dr M. Bal Raju 2 1* CSE Research Scholar, J N T U Hyderabad, Telangana (India) 2 Professor and principal, K
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationFormal Statement of Simple Linear Regression Model
Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor
More informationPart 6: Multivariate Normal and Linear Models
Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of
More informationRemedial Measures for Multiple Linear Regression Models
Remedial Measures for Multiple Linear Regression Models Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Remedial Measures for Multiple Linear Regression Models 1 / 25 Outline
More informationRobust model selection criteria for robust S and LT S estimators
Hacettepe Journal of Mathematics and Statistics Volume 45 (1) (2016), 153 164 Robust model selection criteria for robust S and LT S estimators Meral Çetin Abstract Outliers and multi-collinearity often
More informationUnit 10: Simple Linear Regression and Correlation
Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for
More informationAutocorrelation function of the daily histogram time series of SP500 intradaily returns
Autocorrelation function of the daily histogram time series of SP5 intradaily returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA 9252 Javier Arroyo
More informationSTAT Chapter 11: Regression
STAT 515 -- Chapter 11: Regression Mostly we have studied the behavior of a single random variable. Often, however, we gather data on two random variables. We wish to determine: Is there a relationship
More informationLecture 9 SLR in Matrix Form
Lecture 9 SLR in Matrix Form STAT 51 Spring 011 Background Reading KNNL: Chapter 5 9-1 Topic Overview Matrix Equations for SLR Don t focus so much on the matrix arithmetic as on the form of the equations.
More informationOPTIMAL TRANSPORTATION PLANS AND CONVERGENCE IN DISTRIBUTION
OPTIMAL TRANSPORTATION PLANS AND CONVERGENCE IN DISTRIBUTION J.A. Cuesta-Albertos 1, C. Matrán 2 and A. Tuero-Díaz 1 1 Departamento de Matemáticas, Estadística y Computación. Universidad de Cantabria.
More informationLINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises
LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on
More informationAccounting for measurement uncertainties in industrial data analysis
Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos,
More informationVerification of Continuous Forecasts
Verification of Continuous Forecasts Presented by Barbara Brown Including contributions by Tressa Fowler, Barbara Casati, Laurence Wilson, and others Exploratory methods Scatter plots Discrimination plots
More informationSimple Linear Regression for the Climate Data
Prediction Prediction Interval Temperature 0.2 0.0 0.2 0.4 0.6 0.8 320 340 360 380 CO 2 Simple Linear Regression for the Climate Data What do we do with the data? y i = Temperature of i th Year x i =CO
More informationImportance of Input Data and Uncertainty Associated with Tuning Satellite to Ground Solar Irradiation
Importance of Input Data and Uncertainty Associated with Tuning Satellite to Ground Solar Irradiation James Alfi 1, Alex Kubiniec 2, Ganesh Mani 1, James Christopherson 1, Yiping He 1, Juan Bosch 3 1 EDF
More informationModule Master Recherche Apprentissage et Fouille
Module Master Recherche Apprentissage et Fouille Michele Sebag Balazs Kegl Antoine Cornuéjols http://tao.lri.fr 19 novembre 2008 Unsupervised Learning Clustering Data Streaming Application: Clustering
More informationForecasting Wind Ramps
Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators
More informationMath 423/533: The Main Theoretical Topics
Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)
More informationPrincipal Component Analysis for Interval Data
Outline Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account Outline Outline 1 Introduction to
More informationLecture 10 Multiple Linear Regression
Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable
More informationGov 2002: 5. Matching
Gov 2002: 5. Matching Matthew Blackwell October 1, 2015 Where are we? Where are we going? Discussed randomized experiments, started talking about observational data. Last week: no unmeasured confounders
More informationReview of Statistics 101
Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods
More informationSupport Vector Method for Multivariate Density Estimation
Support Vector Method for Multivariate Density Estimation Vladimir N. Vapnik Royal Halloway College and AT &T Labs, 100 Schultz Dr. Red Bank, NJ 07701 vlad@research.att.com Sayan Mukherjee CBCL, MIT E25-201
More informationSTAT 4385 Topic 03: Simple Linear Regression
STAT 4385 Topic 03: Simple Linear Regression Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2017 Outline The Set-Up Exploratory Data Analysis
More informationMatrix Approach to Simple Linear Regression: An Overview
Matrix Approach to Simple Linear Regression: An Overview Aspects of matrices that you should know: Definition of a matrix Addition/subtraction/multiplication of matrices Symmetric/diagonal/identity matrix
More informationNew models for symbolic data analysis
New models for symbolic data analysis B. Beranger, H. Lin and S. A. Sisson arxiv:1809.03659v1 [stat.co] 11 Sep 2018 Abstract Symbolic data analysis (SDA) is an emerging area of statistics based on aggregating
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationInvestigation of Monthly Pan Evaporation in Turkey with Geostatistical Technique
Investigation of Monthly Pan Evaporation in Turkey with Geostatistical Technique Hatice Çitakoğlu 1, Murat Çobaner 1, Tefaruk Haktanir 1, 1 Department of Civil Engineering, Erciyes University, Kayseri,
More informationBayesian spatial quantile regression
Brian J. Reich and Montserrat Fuentes North Carolina State University and David B. Dunson Duke University E-mail:reich@stat.ncsu.edu Tropospheric ozone Tropospheric ozone has been linked with several adverse
More informationLeast Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD
Least Squares Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 75A Winter 0 - UCSD (Unweighted) Least Squares Assume linearity in the unnown, deterministic model parameters Scalar, additive noise model: y f (
More informationProbabilistic forecasting of solar radiation
Probabilistic forecasting of solar radiation Dr Adrian Grantham School of Information Technology and Mathematical Sciences School of Engineering 7 September 2017 Acknowledgements Funding: Collaborators:
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationChapter 3. Riemannian Manifolds - I. The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves
Chapter 3 Riemannian Manifolds - I The subject of this thesis is to extend the combinatorial curve reconstruction approach to curves embedded in Riemannian manifolds. A Riemannian manifold is an abstraction
More informationCorrelation and regression
1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,
More informationMathematics for Economics MA course
Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between
More informationSTAT 350: Geometry of Least Squares
The Geometry of Least Squares Mathematical Basics Inner / dot product: a and b column vectors a b = a T b = a i b i a b a T b = 0 Matrix Product: A is r s B is s t (AB) rt = s A rs B st Partitioned Matrices
More informationTable of Contents. Multivariate methods. Introduction II. Introduction I
Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation
More informationRegression Models - Introduction
Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent
More informationData Analytics for Solar Energy Management
Data Analytics for Solar Energy Management Lipyeow Lim1, Duane Stevens2, Sen Chiao3, Christopher Foo1, Anthony Chang2, Todd Taomae1, Carlos Andrade1, Neha Gupta1, Gabriella Santillan2, Michael Gonzalves2,
More informationREVIEW (MULTIVARIATE LINEAR REGRESSION) Explain/Obtain the LS estimator () of the vector of coe cients (b)
REVIEW (MULTIVARIATE LINEAR REGRESSION) Explain/Obtain the LS estimator () of the vector of coe cients (b) Explain/Obtain the variance-covariance matrix of Both in the bivariate case (two regressors) and
More informationof the 7 stations. In case the number of daily ozone maxima in a month is less than 15, the corresponding monthly mean was not computed, being treated
Spatial Trends and Spatial Extremes in South Korean Ozone Seokhoon Yun University of Suwon, Department of Applied Statistics Suwon, Kyonggi-do 445-74 South Korea syun@mail.suwon.ac.kr Richard L. Smith
More informationAnalytics for an Online Retailer: Demand Forecasting and Price Optimization
Analytics for an Online Retailer: Demand Forecasting and Price Optimization Kris Johnson Ferreira Technology and Operations Management Unit, Harvard Business School, kferreira@hbs.edu Bin Hong Alex Lee
More informationstatistical sense, from the distributions of the xs. The model may now be generalized to the case of k regressors:
Wooldridge, Introductory Econometrics, d ed. Chapter 3: Multiple regression analysis: Estimation In multiple regression analysis, we extend the simple (two-variable) regression model to consider the possibility
More information12 Discriminant Analysis
12 Discriminant Analysis Discriminant analysis is used in situations where the clusters are known a priori. The aim of discriminant analysis is to classify an observation, or several observations, into
More informationECE 661: Homework 10 Fall 2014
ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationPattern Recognition of Multivariate Time Series using Wavelet
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 2011, Dublin (Session CPS070) p.6862 Pattern Recognition of Multivariate Time Series using Wavelet Features Maharaj, Elizabeth Ann Monash
More informationStat 602 Exam 1 Spring 2017 (corrected version)
Stat 602 Exam Spring 207 (corrected version) I have neither given nor received unauthorized assistance on this exam. Name Signed Date Name Printed This is a very long Exam. You surely won't be able to
More informationHomework 2: Simple Linear Regression
STAT 4385 Applied Regression Analysis Homework : Simple Linear Regression (Simple Linear Regression) Thirty (n = 30) College graduates who have recently entered the job market. For each student, the CGPA
More informationAsymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands
Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department
More informationthe logic of parametric tests
the logic of parametric tests define the test statistic (e.g. mean) compare the observed test statistic to a distribution calculated for random samples that are drawn from a single (normal) distribution.
More informationChapter 11 Building the Regression Model II:
Chapter 11 Building the Regression Model II: Remedial Measures( 補救措施 ) 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 11 1 / 48 11.1 WLS remedial measures may
More informationInference for Regression Simple Linear Regression
Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating
More informationChapter 11 Estimation of Absolute Performance
Chapter 11 Estimation of Absolute Performance Purpose Objective: Estimate system performance via simulation If q is the system performance, the precision of the estimator qˆ can be measured by: The standard
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Andrew O. Finley 1 and Sudipto Banerjee 2 1 Department of Forestry & Department of Geography, Michigan
More informationDescriptive Statistics for Symbolic Data
Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationNATCOR Regression Modelling for Time Series
Universität Hamburg Institut für Wirtschaftsinformatik Prof. Dr. D.B. Preßmar Professor Robert Fildes NATCOR Regression Modelling for Time Series The material presented has been developed with the substantial
More informationInference for the Regression Coefficient
Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression line. We can shows that b 0 and b 1 are the unbiased estimates
More informationGeneralization of the Principal Components Analysis to Histogram Data
Generalization of the Principal Components Analysis to Histogram Data Oldemar Rodríguez 1, Edwin Diday 1, and Suzanne Winsberg 2 1 University Paris 9 Dauphine, Ceremade Pl Du Ml de L de Tassigny 75016
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Alan Gelfand 1 and Andrew O. Finley 2 1 Department of Statistical Science, Duke University, Durham, North
More informationMore on Unsupervised Learning
More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data
More informationHierarchical Modelling for Multivariate Spatial Data
Hierarchical Modelling for Multivariate Spatial Data Geography 890, Hierarchical Bayesian Models for Environmental Spatial Data Analysis February 15, 2011 1 Point-referenced spatial data often come as
More informationTime Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP500 Intradaily Returns
Time Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP5 Intradaily Returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA
More informationSpatial Statistics with Image Analysis. Lecture L11. Home assignment 3. Lecture 11. Johan Lindström. December 5, 2016.
HA3 MRF:s Simulation Estimation Spatial Statistics with Image Analysis Lecture 11 Johan Lindström December 5, 2016 Lecture L11 Johan Lindström - johanl@maths.lth.se FMSN20/MASM25 L11 1/22 HA3 MRF:s Simulation
More informationTIME SERIES DATA PREDICTION OF NATURAL GAS CONSUMPTION USING ARIMA MODEL
International Journal of Information Technology & Management Information System (IJITMIS) Volume 7, Issue 3, Sep-Dec-2016, pp. 01 07, Article ID: IJITMIS_07_03_001 Available online at http://www.iaeme.com/ijitmis/issues.asp?jtype=ijitmis&vtype=7&itype=3
More informationBayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes
Bayesian dynamic modeling for large space-time weather datasets using Gaussian predictive processes Sudipto Banerjee 1 and Andrew O. Finley 2 1 Biostatistics, School of Public Health, University of Minnesota,
More information