Co-clustering algorithms for histogram data

Similar documents
Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2

I record climatici del mese di settembre 2005

A new linear regression model for histogram-valued variables

Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid

New Clustering methods for interval data

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance

Sand Piles. From Physics to Cellular Automata.

Iterative ARIMA-Multiple Support Vector Regression models for long term time series prediction

ISTITUTO NAZIONALE DI FISICA NUCLEARE CONSIGLIO DIRETTIVO DELIBERAZIONE N

Introduction to clustering methods for gene expression data analysis

Spazio, luogo, territorio. Roma, 22 marzo 2019

ZOOLOGIA EVOLUZIONISTICA. a. a. 2016/2017 Federico Plazzi - Darwin

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data

Introduction to clustering methods for gene expression data analysis

Analysis of Simulated Data

Order statistics for histogram data and a box plot visualization tool

RENDICONTI LINCEI MATEMATICA E APPLICAZIONI

Space-time clustering for identifying population patterns from smartphone data

COMUNICATO STAMPA ACQUISTO DI AZIONI PROPRIE AL SERVIZIO DEL SISTEMA INCENTIVANTE DI LUNGO TERMINE /20 A SOSTEGNO DEL PIANO INDUSTRIALE

Multilevel Multiple Imputation in presence of interactions, non-linearities and random slopes

Modelling and Analysing Interval Data

How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data

arxiv: v1 [math.pr] 16 Jun 2010

Single gene analysis of differential expression

CORSO DI LAUREA MAGISTRALE IN FISICA COD. 0518H CLASSE LM-17 (FISICA) A.A. 2017/18

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Fluid Mechanics and Locomotion

Keywords: Asymptotic independence; Bivariate distribution; Block maxima; Extremal index; Extreme value theory; Markov chain; Threshold methods.

Easy Categorization of Attributes in Decision Tables Based on Basic Binary Discernibility Matrix

CHAPTER 6 : LITERATURE REVIEW

P E R E N C O - C H R I S T M A S P A R T Y

On the calculation of inner products of Schur functions

Part III: Traveling salesman problems

On the Unirationality of the Quintic Hypersurface Containing a 3-Dimensional Linear Space

FEATURE SELECTION COMBINED WITH RANDOM SUBSPACE ENSEMBLE FOR GENE EXPRESSION BASED DIAGNOSIS OF MALIGNANCIES

Entropy Analysis of Synthetic Time Series

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

From imprecise probability assessments to conditional probabilities with quasi additive classes of conditioning events

Word-representability of line graphs

Iterative Laplacian Score for Feature Selection

Clustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University

RENDICONTI LINCEI MATEMATICA E APPLICAZIONI

Reconstruction, prediction and. of multiple monthly stream-flow series

Bi Dimensional Partitioning with Limited Loss of Information

Parametric technique

Visual Cryptography Schemes with Optimal Pixel Expansion

Informazione Quantistica e Fondamenti della Meccanica Quantistica e dei Campi

ANALYSIS OF WEIBULL PARAMETERS FOR WIND POWER GENERATION

GROUPS WITH PERMUTABILITY CONDITIONS FOR SUBGROUPS OF INFINITE RANK. Communicated by Patrizia Longobardi. 1. Introduction

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

A Neural Qualitative Approach for Automatic Territorial Zoning

Problems of Fuzzy c-means and similar algorithms with high dimensional data sets Roland Winkler Frank Klawonn, Rudolf Kruse

Algoritmi e strutture di dati 2

Computational Experiments for the Problem of Sensor-Mission Assignment in Wireless Sensor Networks

Matrix Formulation for Linear and First- Order Nonlinear Regression Analysis with Multidimensional Functions

A REACHABLE THROUGHPUT UPPER BOUND FOR LIVE AND SAFE FREE CHOICE NETS VIA T-INVARIANTS

TESTING-CERTIFICAZIONE TESTING & CERTIFICATION. Fisica della Combustione Physics of Combustion 1/8 0393\DC\REA\17_1 15/05/2017.

An Adaptive Clustering Method for Model-free Reinforcement Learning

THE DETECTION OF LEAKS FROM TAPACURÁ DAM, BRAZIL

Explicit formulas using partitions of integers for numbers defined by recursion

Towards the use of Simplification Rules in Intuitionistic Tableaux

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Unsupervised machine learning

Clustering analysis of vegetation data

On Agreement Problems with Gossip Algorithms in absence of common reference frames

THE STOCHASTIC INTERPRETATION OF THE DAGUM PERSONAL INCOME DISTRIBUTION: A TALE

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

A Machine Learning Approach to Define Weights for Linear Combination of Forecasts

Cinematica dei Robot Mobili su Ruote. Corso di Robotica Prof. Davide Brugali Università degli Studi di Bergamo

Rendiconti del Seminario Matematico della Università di Padova, tome 88 (1992), p

Decision trees on interval valued variables

On the diameter of the Cayley. graphs H l,p.

Serie temporali: analisi e decomposizione

A NOTE ON C-ANALYTIC SETS. Alessandro Tancredi

Visual cryptography schemes with optimal pixel expansion

arxiv:quant-ph/ v1 27 Feb 2007

Optimum Binary-Constrained Homophonic Coding

Advanced metric adaptation in Generalized LVQ for classification of mass spectrometry data

arxiv: v2 [math.ds] 14 May 2012

NONLINEAR CALIBRATION. 1 Introduction. 2 Calibrated estimator of total. Abstract

A Bi-clustering Framework for Categorical Data

Simulation on a partitioned urban network: an approach based on a network fundamental diagram

Tecnologie GIS nella scelta dei dati di input per modellazioni numeriche 3D dello stato di fratturazione.

Partitioning of Cellular Automata Rule Spaces

Territorial vulnerability analysis: the case studies

Classification of Voice Signals through Mining Unique Episodes in Temporal Information Systems: A Rough Set Approach

Linear Regression Model with Histogram-Valued Variables

Jordan Journal of Mathematics and Statistics (JJMS) 11(1), 2018, pp I-CONVERGENCE CLASSES OF SEQUENCES AND NETS IN TOPOLOGICAL SPACES

Determinantal Priors for Variable Selection

Approximation Properties of Positive Boolean Functions

Table of Contents. Multivariate methods. Introduction II. Introduction I

Radon-Thoron mixed atmosphere: realization, characterization, monitoring and use for detector calibration.

Modelli Lineari (Generalizzati) e SVM

Arturo Carpi 1 and Cristiano Maggi 2

Relevant Minimal Change in Belief Update

A Method for Constructing Diagonally Dominant Preconditioners based on Jacobi Rotations

OPTIMIZATION TECHNIQUES FOR EDIT VALIDATION AND DATA IMPUTATION

Transcription:

Co-clustering algorithms for histogram data Algoritmi di Co-clustering per dati ad istogramma Francisco de A.T. De Carvalho and Antonio Balzanella and Antonio Irpino and Rosanna Verde Abstract One of the current big-data age requirements is the need of representing groups of data by summaries allowing the minimum loss of information as possible. Recently, histograms have been used for summarizing numerical variables keeping more information about the data generation process than characteristic values such as the mean, the standard deviation, or quantiles. We propose two co-clustering algorithms for histogram data based on the double k-means algorithm. The first proposed algorithm, named distributional double Kmeans (DDK), is an extension of double Kmeans (DK) proposed to manage usual quantitative data, to histogram data. The second algorithm, named adaptive distributional double Kmeans (ADDK), is an extension of DDK with automated variable weighting allowing co-clustering and feature selection simultaneously. Abstract Una delle principali esigenze nell era dei big-data è quella di rappresentare gruppi di dati attraverso strumenti di sintesi che minimizzano la perdita di informazione. Negli ultimi anni, uno degli strumenti maggiormente utilizzati a tal scopo è l istogramma. Esso fornisce una sintesi della distribuzione che genera i dati risultando più informativo delle classiche sintesi quali la media, la deviazione standard, o i quantili. Nel presente articolo, si propongono due algoritmi di coclustering per dati rappresentati da istogrammi che estendono il classico double k- means algorithm (DK). La prima proposta chiamata distributional double Kmeans (DDK), è un estensione dell algoritmo DK a dati ad istogramma. La seconda pro- Francisco de A.T. De Carvalho Centro de Informatica, Universidade Federal de Pernambuco, Av. Jornalista Anibal Fernandes s/n - Cidade Universitaria, CEP 50740-560, Recife-PE, Brazil, e-mail: fatc@cin.ufpe.br Antonio Balzanella Università della Campania L. Vanvitelli, e-mail: antonio.balzanella@unicampania.it Antonio Irpino Università della Campania L. Vanvitelli, e-mail: antonio.irpino@unicampania.it Rosanna Verde Università della Campania L. Vanvitelli e-mail: rosanna.verde@unicampania.it 1

2 Authors Suppressed Due to Excessive Length posta, chiamata adaptive distributional double Kmeans (ADDK), è un estensione dell algoritmo DDK che effettua la ponderazione automatica delle variabili consentendo di effettuare simultaneamente il co-clustering e la selezione delle variabili. Key words: Co-clustering, Histogram data 1 Introduction Histogram data are becoming very common in several applicative fields. For example, in order to preserve the individuals privacy, data about groups of customers transactions are released after being aggregated; in wireless sensor networks, where the energy limitations constraint the communication of data, the use of suitable synthesis of sensed data are necessary; official statistical institutes collect data about territorial units or administrations and release them as histograms. Among the exploratory tools for the analysis of histogram data, this paper focuses on co-clustering, also known as bi-clustering or block clustering. The aim is to cluster simultaneously objects and variables of a data set [1, 2, 5, 6]. By performing permutations of rows and columns, the co-clustering algorithms aim to reorganize the initial data matrix into homogeneous blocks. These blocks also called co-clusters can therefore be defined as subsets of the data matrix characterized by a set of observations and a set of features whose elements are similar. They resume the initial data matrix into a much smaller matrix representing homogeneous blocks or co-clusters of similar objects and variables. Refs. [4] presents other types of co-clustering approaches. This paper proposes at first the DDK (Distributional Double K-means) algorithm whose aim is to cluster, simultaneously, objects and variables on distributionalvalued data sets. Then, it introduces the ADDK (Adaptive Distributional Double K-means) algorithm which takes into account the relevance of the variables in the co-clustering optimization criterion. Conventional co-clustering methods do not take into account the relevance of the variables, i.e., these methods consider that all variables are equally important to the co-clustering task, however, in most applications some variables may be irrelevant and, among the relevant ones, some may be more or less relevant than others. ADDK and DDK use, respectively, suitable adaptive and non-adaptive Wasserstein distances aiming to compare distributional-valued data during the co-clustering task. 2 Co-clustering algorithms for distributional-valued data Let E = e 1,...,e N be a set of N objects described by a set of P distributionalvalued variables denoted by Y j (1 j P). Let Y = Y 1,...,Y P be the set of P

Co-clustering algorithms for histogram data 3 distributional-valued variables and let Y = ( y i j )1 i N 1 j P be a distributional-valued data matrix of size N P where the distributional data observed on the Y j variable for the i-th object is denoted with y i j. Our aim consists in obtaining a co-clustering of Y, i.e, in obtaining simultaneously a partition P = P 1,...,P C of the set of N objects into C clusters and a partition Q = Q 1,...,Q H of the set of P distributional-valued variables into H clusters. The co-clustering can be formulated as the search for a good matrix approximation of the original distributional-valued data matrix Y by a C H matrix G = ( g kh )1 k C 1 h H which can be viewed as a summary of the distributional-valued data matrix Y (see, for example, Refs. [2, 5, 6]). Each element g kh of G is also called a prototype of the co-cluster Y kh = ( y k j )e i P k Moreover, each g kh is a distributional data, with a distribution function G kh and a quantile function Q gkh In order to obtain a co-clustering that is faithfully representative of the distributionalvalued data set Y, the matrix G of prototypes, the partition P of the objects and the partition Q of the distributional-valued variables are obtained iteratively by means of the minimization of an error function J DDK, computed as follows: J DDK (G,P,Q) = C H k=1 h=1 e i P k d 2 W (y i j,g kh ) (1) where the d 2 W function is the non-adaptive (squared) L 2 Wasserstein distance computed between the element y i j of the distributional data matrix Y and the prototype g kh of co-cluster Y kh In order to propose an adaptive version of the DDK algorithm which evaluates the relevance of each variable in the co-clustering process, we still propose the local minimization of the following criterion function: J ADDK (G,Λ,P,Q) = C H k=1 h=1 e i P k λ j d 2 W (y i j,g kh ) (2) where Λ = (λ j ) j=1,...,p (with λ j > 0 and P j=1 λ j = 1) are positive weights measuring the importance of each distributional-valued variable. The minimization of the J DDK criterion, is performed iteratively in three steps: representation, objects assignments and distributional-valued variables assignments.

4 Authors Suppressed Due to Excessive Length The representation step gives the optimal solution for the computation of the representatives (prototypes) of the co-clusters. The objects assignments step provides the optimal solution for the partition of the objects. Finally, the variables assignments step provides the optimal solution for the partition of the variables. The three steps are iterated until the convergence to a stable optimal solution. The minimization of the J ADDK criterion, requires a further weighting step which provides optimal solutions for the computation of the relevance weights for the distributional-valued variables. Representation step in DDK and ADDK In the representation step, DDK algorithm aims to find, for k = 1,...,C and for h = 1,...,H, the prototype g kh such that ei P k Yj Q h d 2 W (y i j,g kh ) is minimal. According to [3], the quantile function associated with the corresponding probability density function (pdf ) g kh (1 k C;1 h H) is: Q gkh = e i P k Yj Q h Q i j n k n h (3) where n k is the cardinality of P k and n h is the cardinality of Q h. The ADDK algorithm aims to find, for k = 1,...,C and for h = 1,...,H, prototype g kh such that ei P k Yj Q h λ j d 2 W (y i j,g kh ) is minimal. Under the constraints P j=1 λ j = 1, λ j > 0, the quantile function associated with the corresponding probability density function (pdf ) g kh (1 k C;1 h H) is computed as follows: Q gkh = e i P k Yj Q h λ j Q i j n k Yj Q h λ j (4) Objects assignment step in DDK and ADDK During the object assignment step of DDK, the matrix of co-cluster prototypes G and the partition of the distributional-valued variables Q are kept fixed. The error function J DDK is minimized with respect to the partition P of objects and each object e i E is assigned to its nearest co-cluster prototype. Proposition 1. The error function J DDK (Eq. 1) is minimized with respect to the partition P of objects when the clusters P k (k = 1,...,C) are updated according to the following assignment function: H P k = e i E : dw 2 (y i j,g kh ) = C H min dw 2 (y i j,g zh ) h=1 h=1

Co-clustering algorithms for histogram data 5 The error function J ADDK is minimized with respect to the partition P and each individual e i E is assigned to its nearest co-cluster prototype. Proposition 2. The error function J ADDK (Eq. 2) is minimized with respect to the partition P of objects when the clusters P k (k = 1,...,C) are updated according to the following assignment function: H P k = e i E : λ j dw 2 (y i j,g kh ) = C H min λ j dw 2 (y i j,g zh ) h=1 h=1 Variables assignment step in DDK and ADDK During the variables assignment step of DDK, the matrix of prototypes G and the partition of the objects P are kept fixed. The error function J DDK is minimized with respect to the partition Q of the distributional-valued variables and each variable Y j Y is assigned to its nearest co-cluster prototype. Proposition 3. The error function J DDK (Eq. 1) is minimized with respect to the partition Q of the distributional-valued variables when the clusters Q h (h = 1,...,H) are updated according to the following assignment function: C Q h = Y j Y : dw 2 (y i j,g kh ) = H C min dw 2 (y i j,g kz ) k=1 e i P k k=1 e i P k where d 2 W (y i j,g kz ) is the squared L 2 Wasserstein distance. Proposition 4. The error function J ADDK (Eq. 2) is minimized with respect to the partition Q of the distributional-valued variables when the clusters Q h (h = 1,...,H) are updated according to the following assignment function: C Q h = Y j Y : λ j dw 2 (y i j,g kh ) = H C min λ j dw 2 (y i j,g kz ) k=1 e i P k k=1 e i P k Weighting step for ADDK We provide an optimal solution for the computation of the relevance weight of each distributional-valued variable during the weighting step of the ADDK algorithm. During the weighting step of ADDK, the matrix of prototype vectors G, the partition P of the objects and the partition Q of the distributional-valued variables are kept fixed. The error function J ADDK is minimized with respect to the weights λ j. Proposition 5. The relevance weights are computed according to the adaptive squared L 2 Wasserstein distance:

6 Authors Suppressed Due to Excessive Length If we assume that P j=1 λ j = 1, λ j > 0, the P relevance weights are computed as follows: ( λ j = P r=1 C k=1 H h=1 e i P k C H k=1 h=1 e i P k dw 2 (y ir,g kh ) Y r Q h dw 2 (y i j,g kh ) ) 1 P (5) 3 Conclusions In this paper we have introduced two algorithms, based on the double k-means, for performing the co-clustering of a distributional valued data matrix. The main difference between the two algorithms is that ADDK integrates in the optimization criterion the search for a set of weights which measure the relevance of each variable. In order to evaluate the effectiveness of the proposal, we have made some preliminary test on real and simulated data with encouraging results. References 1. Govaert G.: Simultaneous clustering of rows and columns. In: Control and Cybernetics 24 pp. 437 458 (1995) 2. Govaert G., Nadif M.: Co-Clustering: Models, Algorithms and Applications. Wiley, New York (2015) 3. Irpino, A. and Verde, R. Basic statistics for distributional symbolic variables: a new metricbased approach. In: Advances in Data Analysis and Classification,92, pp. 143 175 Springer Berlin Heidelberg (2015) 4. Pontes R., Giraldez R., Aguilar-Ruiz J.S.: Biclustering on expression data: A review. In: Journal of Biomedical Informatics, 57, pp. 163 180, (2015) 5. Rocci R., Vichi M.: Two-mode multi-partitioning. In: Computational Statistics & Data Analysis, 52 pp.1984 2003 (2008) 6. Vichi M.: Double k-means Clustering for Simultaneous Classification of Objects and Variables. In: Advances in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 43 52, Springer (2001)