Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid

Size: px
Start display at page:

Download "Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid"

Transcription

1 Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid

2 Introduction 2

3 Symbolic data Symbolic data was introduced by Edwin Diday in 1987 to represent variability Symbolic variables make possible to describe groups of individuals and concepts Symbolic variables include List of values variables (with or without weights), Interval variables, and Histogram variables Symbolic representations can include internal structure (hierarchies) and logical dependency (rules) 3

4 Interval Data Interval data is by far the most popular symbolic representation Interval data is naturally used in several contexts: For example, Meteorology and Finance It is easier to propose methods for interval variables than for other symbolic variables Intervals are represented by two values: minimum and maximum, or center and radius The truly symbolic methods are those that deal with interval representations and not those that represent them by means of two classical variables 4

5 Histogram data Histogram data is the younger brother of interval data Histograms is a statistical tool Histogram data does not arise naturally Histograms requires the definition of parameters There are less methods developed for histogram data Because the histogram representation is considerably more sophisticated than the interval one However, the situation is changing in the recent years and histogram (and distributional) data is developing its potential Distributions are the numbers of the future present! 5

6 From histogram data to distributional data 6

7 From histogram data to distributional data Distributional variables make possible to describe each unit of interest by means of its observed data distribution A distribution does not summarize data by means of statistics, such as the mean, variance, minimum and maximum, etc. IT IS THE DATA! The distribution has to be represented in some way The practitioner can focus in the representation that fits better his problem Two types of representations: Binned density estimators, such as histograms Smooth and continue density estimators 7

8 Binned density estimators Binned density estimators offer a great variety of choices for the analyst Histograms with a fixed bin width (classical histogram) If an accurate representation is needed, there are rules to determine the optimal bin width (Wand, 1997) Equifrequency histograms (such as, boxplots) Histograms defined on a given partition of the range of the variable of interest Fink et al. (2009) propose a method to accurately estimate histograms with variable-width bars Histograms defined as a sequence of quantiles of specific interest The frequency poligon Delicado and Del Rio (2003) propose a generalization of frequency poligons to accurately estimate distributions 8

9 Binned density estimators Binned density estimators have some problems too Not smooth Depend on end points of bins Depend on width of bins 9

10 Smooth estimators If the analyst wants a smooth representation, he/she can use parametric methods, if the underlying distribution is known Kernel density estimation methods (Simonoff, 1998) require A kernel: uniform, triangular, Epanechnikov, normal, etc The selection of the bandwidth of the kernel As small as the data allow, Trade-off between the bias of the estimator and its variance It is usually chosen the value that minizes the AMISE (Asymptotic Mean Integrated Squared Error) Jones et al. (1996) survey of methods to estimate the bandwith 10

11 The Quantile Function 11

12 The quantile function The quantile function (QF) is the inverse of the cumulative distribution function Density function Cumulative distribution function Quantile function 12

13 The quantile function The QF provides a common framework to represent data described by histograms, intervals, nominal multi-value types, etc (Ichino, 2011) The QF is the conceptual underpinning of many of the methods proposed for histogram data The QF has some interesting features: Fixed range for the X-axis [0,1] The Wasserstein metrics for QF make possible to compute distribution-valued central tendency moments and its associated dispersion measures of distributional data These distribution-valued moments are the basis to propose many methods for distributional data, E.g. methods based on the concept of average 13

14 The catalogue of methods 14

15 Methods proposed for distributional data Brito (2012) provides a very nice survey on the topic The catalogue of methods for distributional data includes Descriptive statistics Regression Clustering Dimensionality reduction Time series forecasting Visualization methods 15

16 Methods proposed for distributional data Descriptive statistics Bertrand and Goupil (2000) proposes real-valued univariate and bivariate statistics, extended by Billard and Diday (2006) Irpino and Verde (2014) proposes a new set of real-valued statistics based on the l 2 Wasserstein distance Rivoli et al. (2012) propose central tendency moments in distribution form based on the family of Wasserstein metrics Regression Billard and Diday (2006) proposed the first model based on real-valued univariate and bivariate statistics Irpino and Verde (2012) and Dias and Brito (2013) propose regression models based on the QF and the l 2 Wasserstein distance 16

17 Methods proposed for distributional data Clustering Many approaches to cluster this kind of data The simple hierarchical models just need an appropriate distance Irpino and Verde (2006) propose a dynamic clustering method based on the l 2 Wasserstein distance and that averaged histograms for the first time Brito and Ichino (2011) propose hierarchical clustering methods based on quantile representations of the different types of data Brito and Chavent (2012) propose a divisive algorithm that works with interval and/or histogram-valued variables using appropriate distances 17

18 Methods proposed for distributional data Dimensionality reduction Rodriguez et al. (2000) propose a Principal Component Analysis for histograms, where histograms are a succession of nested intervals Nagabhushan and Pradeep Kumar (2007) propose a histogram arithmetic to extend PCA for a simplified version of histogram data (more close to compositional data) Delicado (2011) extends PCA to density functions using functional data methods and tools from compositional data (Egozcue et al., 2006) Ichino (2011) adapts the Principal Components Analysis to work with symbolic data using the quantile representation Time series forecasting Arroyo and Maté (2009) adapts the k-nn Arroyo et al. (2011) adapts exponential smoothing methods All these methods are based on Wasserstein metrics 18

19 Methods proposed for distributional data Visualization methods Sopan et al. (2013) propose methods to intuitively visualize distributional data sets Distributions of distributions 19

20 How do methods work with both representations? Methods based on quantile functions and/or distances for quantile functions They do not need adaptation to work with both histograms and smooth distributions Methods based on bin-representations They are meant to deal with histogram representations They can be adapted to work with smooth representations First, the smooth representation has to be estimated Second, the smooth representation is transformed into a histogram representation by a sufficiently large number of bins 20

21 How do methods work with both representations? The smooth representation can be transformed into a histogram by a sufficiently large number of bins 21

22 How about the performance of this trick? l2 Wasserstein distance for histogram data Matlab 7 R2010 on a laptop PC 1.8GHz 2 cores 8GB RAM Simulate two sets of 10 6 data from N(1,1) and N(3,1) Estimate the two quantile histograms and measure l 2 Wasserstein distance between them (x1000 times) # bins time (sec) distance Smooth - 2 The matlab code still could be optimized 22

23 Some contributions from other fields 23

24 Distributions as seen from other fields Compositional data (Aitchison, 1986) is a related field Non-negative data with constant sum that provide a quantitative description of the parts of some whole Egozcue et al. (2006) say that density functions are infinite dimensional compositional data Distributional data can be considered as a particular case of functional data (Ramsay and Silverman, 1997) It is not direct to extend functional methods to distributional data Pointwise operations such as linear combination for functional data do not work Some alternatives to operate with them are needed Aitchison geometry (Aitchison, 1986): operations and distances Convolution operators 24

25 Applications everywhere! 25

26 Applications everywhere! Image analysis Images are represented by histograms Applications in fields such as precision agriculture or medicine (Sharma et al. 2013) Sensor and radar data Spatial and temporal aggregation For example, river level data and cloud sensor data Wind power production: aggregation of the generators in a wind farm Finance and Economics Contemporaneous and temporal aggregation of financial returns Applications in portfolio management and Value at Risk (Arroyo et al. 2011) 26

27 Applications everywhere! Official Statistics (distributions of variables in the population) Population pyramids (Delicado, 2011) Income distributions (Kneip and Utikal, 2001; Delicado, 2007) Rating distributions Recommendation systems and rating webs (Sopan et al., 2013) Trust and Reputation systems Opportunities in big data SDA 2014 & COMPSTAT 2014 sessions! and in many other fields E.g. social networks data and computer logs (web, search engines, etc) If a mean or a sample is used as a tool for aggregation to make the data manageable, then it is possible to use a distribution 27

28 Conclusions Distributional data is now a mature field in terms of theory Homogeneous conceptual underpinning Diverse catalogue of methods High applicability in real life problems However, it is barely known and barely used outside the symbolic family. So, we have a mission: Spread the word Propose interesting applications Possible synergies with other fields, such as functional data or compositional data 28

29 How can I analyze my distributional data? There is a plan to develop a toolbox for distributional data Matlab, Octave and R Open Source For binned and smooth representations To include: Distribution estimation and visualization Descriptive Statistics Clustering Regression Time Series forecasting Eventually, all the methods proposed People involved: Antonio Irpino, Antonio Balzanella, Sonia Dias and Javier Arroyo 29

30 References 30

31 References Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall Arroyo, J., Maté, C. (2009). Forecasting histogram time series with k-nearest neighbours methods. International Journal of Forecasting 25 (1), Arroyo, J., Mate, C., Muñoz San Roque, A. (2011). Smoothing Methods for Histogram-valued Time Series. An application to Value-at-Risk. Statistical Analysis and Data Mining 4, Bertrand, P., Goupil, F. (2000). Descriptive statistics for symbolic data. In: Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, pp Billard, L., Diday, E. (2006). Symbolic data analysis: conceptual statistics and data mining. Wiley Brito, P. (2012). Beyond summaries of individual data: Analyzing distributions. In Symposium on Learning and Data Science,

32 References Brito, P. and Chavent, M. (2012). Divisive Monothetic Clustering for Interval and Histogram-Valued Data. In: Proc. ICPRAM st International Conference on Pattern Recognition Applications and Methods Brito, P. and Ichino, M. (2011). Conceptual Clustering of Symbolic Data Using a Quantile Representation: Discrete and Continuous Approaches. In: Proc. Workshop on Theory and Application of High-dimensional Complex and Symbolic Data Analysis in Economics and Management Science Delicado, P. (2007). Functional k -sample problem when data are density functions. Computational Statistics 22 (3), Delicado, P. (2011). Dimensionality reduction when data are density functions. Computational Statistics and Data Analysis 55 (1), Delicado, P. and del Río, M. (2003). A Generalization of Histogram Type Estimators. Journal of Nonparametric Statistics 15 (1),

33 References Dias, S., Brito, P. (2013). Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables. arxiv: Egozcue, J.J., Díaz-Barrero, J.L., Pawlowsky-Glahn, V. (2006). Hilbert space of probability density functions based on Aitchison geometry. Acta Mathematica Sinica 22, Fink, E., Sarin, A., Carbonell, J. (2009). Analysis of uncertain data: Smoothing of histograms. Proc. of the IEEE International Conference on Systems, Man and Cybernetics, pp Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining 4, Irpino, A., Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In Data Science and Classification, pp

34 References Irpino, A., Verde, R (2012). Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. arxiv v3 Irpino, A., Verde, R (2014). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification Jones, M.C., Marron, J.S., Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91 (433), Kneip A., Utikal K. (2001). Inference for density families using functional principal component analysis. Journal of the American Statistical Association 96, Nagabhushan, P., Pradeep Kumar, R. (2007). Histogram PCA. In Proc. of the 4th International Symposium on Neural Networks, pp Ramsay, J.O., Silverman, B. W. (1997). Functional Data Analysis. Springer Rivoli, L., Irpino, A., Verde, R. (2012). The median of a set of histogram data. In XLVI meeting of theitalia Statistical Society 34

35 References Rodriguez O., Diday E., Winsberg S. (2000). Generalization of the Principal Components Analysis to Histogram Data. In: Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Data Bases Schweizer, B. (1984). Distributions are the numbers of the future. In Proceedings of the mathematics of fuzzy systems meeting (Naples, Italy), pp Sharma, A. et al (2013). Spatiotemporal modeling of discrete-time distribution-valued data applied to DTI tract evolution in infant neurodevelopment. In Proc. IEEE International Symposium Biomedical Imaging. pp Simonoff, J. S. (1998). Smoothing Methods in Statistics. Springer Sopan, A., Freire, M., Taieb-Maimon, M., Plaisant, C., Golbeck, J., Shneiderman, B. (2013). Exploring data distributions: Visual design and evaluation. International Journal of Human-Computer Interaction 29 (2), Wand, M. P. (1997). Data-based choice of histogram bin width. The American Statistician 51(1),

A new linear regression model for histogram-valued variables

A new linear regression model for histogram-valued variables Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do

More information

How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data

How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data for aggregated data Rosanna Verde (rosanna.verde@unina2.it) Antonio Irpino (antonio.irpino@unina2.it) Dominique Desbois (desbois@agroparistech.fr) Second University of Naples Dept. of Political Sciences

More information

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances

More information

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND

More information

Order statistics for histogram data and a box plot visualization tool

Order statistics for histogram data and a box plot visualization tool Order statistics for histogram data and a box plot visualization tool Rosanna Verde, Antonio Balzanella, Antonio Irpino Second University of Naples, Caserta, Italy rosanna.verde@unina.it, antonio.balzanella@unina.it,

More information

Forecasting Complex Time Series: Beanplot Time Series

Forecasting Complex Time Series: Beanplot Time Series COMPSTAT 2010 19 International Conference on Computational Statistics Paris-France, August 22-27 Forecasting Complex Time Series: Beanplot Time Series Carlo Drago and Germana Scepi Dipartimento di Matematica

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Bayes spaces: use of improper priors and distances between densities

Bayes spaces: use of improper priors and distances between densities Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de

More information

J. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis

J. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis A Combined Adaptive-Mixtures/Plug-In Estimator of Multivariate Probability Densities 1 J. Cwik and J. Koronacki Institute of Computer Science, Polish Academy of Sciences Ordona 21, 01-237 Warsaw, Poland

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain; CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.

More information

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive

More information

A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE

A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE BRAC University Journal, vol. V1, no. 1, 2009, pp. 59-68 A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE Daniel F. Froelich Minnesota State University, Mankato, USA and Mezbahur

More information

Modelling and Analysing Interval Data

Modelling and Analysing Interval Data Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss

More information

BNG 495 Capstone Design. Descriptive Statistics

BNG 495 Capstone Design. Descriptive Statistics BNG 495 Capstone Design Descriptive Statistics Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential statistical methods, with a focus

More information

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter

Midwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter Midwest Big Data Summer School: Introduction to Statistics Kris De Brabanter kbrabant@iastate.edu Iowa State University Department of Statistics Department of Computer Science June 20, 2016 1/27 Outline

More information

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Discriminant Analysis for Interval Data

Discriminant Analysis for Interval Data Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

More information

Probability Models for Bayesian Recognition

Probability Models for Bayesian Recognition Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian

More information

Nonparametric Functional Data Analysis

Nonparametric Functional Data Analysis Frédéric Ferraty and Philippe Vieu Nonparametric Functional Data Analysis Theory and Practice April 18, 2006 Springer Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo Preface This work is

More information

On central tendency and dispersion measures for intervals and hypercubes

On central tendency and dispersion measures for intervals and hypercubes On central tendency and dispersion measures for intervals and hypercubes Marie Chavent, Jérôme Saracco To cite this version: Marie Chavent, Jérôme Saracco. On central tendency and dispersion measures for

More information

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1 Statistical Learning - learning

More information

A Program for Data Transformations and Kernel Density Estimation

A Program for Data Transformations and Kernel Density Estimation A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate

More information

A Nonparametric Kernel Approach to Interval-Valued Data Analysis

A Nonparametric Kernel Approach to Interval-Valued Data Analysis A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,

More information

Linear Regression Model with Histogram-Valued Variables

Linear Regression Model with Histogram-Valued Variables Linear Regression Model with Histogram-Valued Variables Sónia Dias 1 and Paula Brito 1 INESC TEC - INESC Technology and Science and ESTG/IPVC - School of Technology and Management, Polytechnic Institute

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

complex data Edwin Diday, University i Paris-Dauphine, France CEREMADE, Beijing 2011

complex data Edwin Diday, University i Paris-Dauphine, France CEREMADE, Beijing 2011 Symbolic data analysis of complex data Edwin Diday, CEREMADE, University i Paris-Dauphine, France Beijing 2011 OUTLINE What is the Symbolic Data Analysis (SDA) paradigm? Why SDA is a good tool for Complex

More information

Co-clustering algorithms for histogram data

Co-clustering algorithms for histogram data Co-clustering algorithms for histogram data Algoritmi di Co-clustering per dati ad istogramma Francisco de A.T. De Carvalho and Antonio Balzanella and Antonio Irpino and Rosanna Verde Abstract One of the

More information

12 - Nonparametric Density Estimation

12 - Nonparametric Density Estimation ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6

More information

MAT Mathematics in Today's World

MAT Mathematics in Today's World MAT 1000 Mathematics in Today's World Last Time 1. Three keys to summarize a collection of data: shape, center, spread. 2. Can measure spread with the fivenumber summary. 3. The five-number summary can

More information

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps

Analysis of Interest Rate Curves Clustering Using Self-Organising Maps Analysis of Interest Rate Curves Clustering Using Self-Organising Maps M. Kanevski (1), V. Timonin (1), A. Pozdnoukhov(1), M. Maignan (1,2) (1) Institute of Geomatics and Analysis of Risk (IGAR), University

More information

Clustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University

Clustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University Clustering and Model Integration under the Wasserstein Metric Jia Li Department of Statistics Penn State University Clustering Data represented by vectors or pairwise distances. Methods Top- down approaches

More information

Descriptive Statistics for Symbolic Data

Descriptive Statistics for Symbolic Data Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

More information

Fast Hierarchical Clustering from the Baire Distance

Fast Hierarchical Clustering from the Baire Distance Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Nonparametric Inference via Bootstrapping the Debiased Estimator

Nonparametric Inference via Bootstrapping the Debiased Estimator Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be

More information

Course in Data Science

Course in Data Science Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an

More information

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the

More information

Interval-Based Composite Indicators

Interval-Based Composite Indicators University of Rome Niccolo Cusano Conference of European Statistics Stakeholders 22 November 2014 1 Building Composite Indicators 2 (ICI) 3 Constructing ICI 4 Application on real data Composite Indicators

More information

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O

O Combining cross-validation and plug-in methods - for kernel density bandwidth selection O O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1 Overview The nonparametric problem

More information

Week 1: Intro to R and EDA

Week 1: Intro to R and EDA Statistical Methods APPM 4570/5570, STAT 4000/5000 Populations and Samples 1 Week 1: Intro to R and EDA Introduction to EDA Objective: study of a characteristic (measurable quantity, random variable) for

More information

Updating on the Kernel Density Estimation for Compositional Data

Updating on the Kernel Density Estimation for Compositional Data Updating on the Kernel Density Estimation for Compositional Data Martín-Fernández, J. A., Chacón-Durán, J. E., and Mateu-Figueras, G. Dpt. Informàtica i Matemàtica Aplicada, Universitat de Girona, Campus

More information

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:

Appendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from: Appendix F Computational Statistics Toolbox The Computational Statistics Toolbox can be downloaded from: http://www.infinityassociates.com http://lib.stat.cmu.edu. Please review the readme file for installation

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010

Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010 Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010 Review Recording observations - Must extract that which is to be analyzed: coding systems,

More information

41903: Introduction to Nonparametrics

41903: Introduction to Nonparametrics 41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific

More information

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction

Econometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction Econometrics I Professor William Greene Stern School of Business Department of Economics 1-1/40 http://people.stern.nyu.edu/wgreene/econometrics/econometrics.htm 1-2/40 Overview: This is an intermediate

More information

Basic Statistical Tools

Basic Statistical Tools Structural Health Monitoring Using Statistical Pattern Recognition Basic Statistical Tools Presented by Charles R. Farrar, Ph.D., P.E. Los Alamos Dynamics Structural Dynamics and Mechanical Vibration Consultants

More information

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

STATISTICS ( CODE NO. 08 ) PAPER I PART - I STATISTICS ( CODE NO. 08 ) PAPER I PART - I 1. Descriptive Statistics Types of data - Concepts of a Statistical population and sample from a population ; qualitative and quantitative data ; nominal and

More information

Estimation of cumulative distribution function with spline functions

Estimation of cumulative distribution function with spline functions INTERNATIONAL JOURNAL OF ECONOMICS AND STATISTICS Volume 5, 017 Estimation of cumulative distribution function with functions Akhlitdin Nizamitdinov, Aladdin Shamilov Abstract The estimation of the cumulative

More information

Kernel Density Estimation

Kernel Density Estimation Kernel Density Estimation and Application in Discriminant Analysis Thomas Ledl Universität Wien Contents: Aspects of Application observations: 0 Which distribution? 0?? 0.0 0. 0. 0. 0.0 0. 0. 0 0 0.0

More information

Autocorrelation function of the daily histogram time series of SP500 intradaily returns

Autocorrelation function of the daily histogram time series of SP500 intradaily returns Autocorrelation function of the daily histogram time series of SP5 intradaily returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA 9252 Javier Arroyo

More information

Adaptive Nonparametric Density Estimators

Adaptive Nonparametric Density Estimators Adaptive Nonparametric Density Estimators by Alan J. Izenman Introduction Theoretical results and practical application of histograms as density estimators usually assume a fixed-partition approach, where

More information

Descriptive Univariate Statistics and Bivariate Correlation

Descriptive Univariate Statistics and Bivariate Correlation ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to

More information

Statistics and parameters

Statistics and parameters Statistics and parameters Tables, histograms and other charts are used to summarize large amounts of data. Often, an even more extreme summary is desirable. Statistics and parameters are numbers that characterize

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Confidence intervals for kernel density estimation

Confidence intervals for kernel density estimation Stata User Group - 9th UK meeting - 19/20 May 2003 Confidence intervals for kernel density estimation Carlo Fiorio c.fiorio@lse.ac.uk London School of Economics and STICERD Stata User Group - 9th UK meeting

More information

Stochastic Hydrology. a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs

Stochastic Hydrology. a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs Stochastic Hydrology a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs An accurate prediction of extreme rainfall events can significantly aid in policy

More information

Exploratory Spatial Data Analysis (ESDA)

Exploratory Spatial Data Analysis (ESDA) Exploratory Spatial Data Analysis (ESDA) VANGHR s method of ESDA follows a typical geospatial framework of selecting variables, exploring spatial patterns, and regression analysis. The primary software

More information

Time Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP500 Intradaily Returns

Time Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP500 Intradaily Returns Time Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP5 Intradaily Returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA

More information

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015 Nemours Biomedical Research Biostatistics Core Statistics Course Session 4 Li Xie March 4, 2015 Outline Recap: Pairwise analysis with example of twosample unpaired t-test Today: More on t-tests; Introduction

More information

Introduction to statistics

Introduction to statistics Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley Schickinger, Steger: Diskrete Strukturen Band 2, Springer David Lilja: Measuring Computer Performance:

More information

Positive data kernel density estimation via the logkde package for R

Positive data kernel density estimation via the logkde package for R Positive data kernel density estimation via the logkde package for R Andrew T. Jones 1, Hien D. Nguyen 2, and Geoffrey J. McLachlan 1 which is constructed from the sample { i } n i=1. Here, K (x) is a

More information

Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology

Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology Erkki Oja Department of Computer Science Aalto University, Finland

More information

FUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1

FUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1 FUNCTIONAL DATA ANALYSIS Contribution to the International Handbook (Encyclopedia) of Statistical Sciences July 28, 2009 Hans-Georg Müller 1 Department of Statistics University of California, Davis One

More information

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Metodološki zvezki, Vol. 9, No. 2, 212, 17-118 Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Katarina Košmelj 1 and Lynne Billard 2 Abstract Mallows L 2 distance

More information

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software Sampling: A Brief Review Workshop on Respondent-driven Sampling Analyst Software 201 1 Purpose To review some of the influences on estimates in design-based inference in classic survey sampling methods

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

A PROBABILITY DENSITY FUNCTION ESTIMATION USING F-TRANSFORM

A PROBABILITY DENSITY FUNCTION ESTIMATION USING F-TRANSFORM K Y BERNETIKA VOLUM E 46 ( 2010), NUMBER 3, P AGES 447 458 A PROBABILITY DENSITY FUNCTION ESTIMATION USING F-TRANSFORM Michal Holčapek and Tomaš Tichý The aim of this paper is to propose a new approach

More information

3 Joint Distributions 71

3 Joint Distributions 71 2.2.3 The Normal Distribution 54 2.2.4 The Beta Density 58 2.3 Functions of a Random Variable 58 2.4 Concluding Remarks 64 2.5 Problems 64 3 Joint Distributions 71 3.1 Introduction 71 3.2 Discrete Random

More information

Methodological Concepts for Source Apportionment

Methodological Concepts for Source Apportionment Methodological Concepts for Source Apportionment Peter Filzmoser Institute of Statistics and Mathematical Methods in Economics Vienna University of Technology UBA Berlin, Germany November 18, 2016 in collaboration

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 4 Spatial Point Patterns Definition Set of point locations with recorded events" within study

More information

LQ-Moments for Statistical Analysis of Extreme Events

LQ-Moments for Statistical Analysis of Extreme Events Journal of Modern Applied Statistical Methods Volume 6 Issue Article 5--007 LQ-Moments for Statistical Analysis of Extreme Events Ani Shabri Universiti Teknologi Malaysia Abdul Aziz Jemain Universiti Kebangsaan

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Kernel density estimation in R

Kernel density estimation in R Kernel density estimation in R Kernel density estimation can be done in R using the density() function in R. The default is a Guassian kernel, but others are possible also. It uses it s own algorithm to

More information

Probabilistic Energy Forecasting

Probabilistic Energy Forecasting Probabilistic Energy Forecasting Moritz Schmid Seminar Energieinformatik WS 2015/16 ^ KIT The Research University in the Helmholtz Association www.kit.edu Agenda Forecasting challenges Renewable energy

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2018 Examinations Subject CT3 Probability and Mathematical Statistics Core Technical Syllabus 1 June 2017 Aim The

More information

Akaike Information Criterion to Select the Parametric Detection Function for Kernel Estimator Using Line Transect Data

Akaike Information Criterion to Select the Parametric Detection Function for Kernel Estimator Using Line Transect Data Journal of Modern Applied Statistical Methods Volume 12 Issue 2 Article 21 11-1-2013 Akaike Information Criterion to Select the Parametric Detection Function for Kernel Estimator Using Line Transect Data

More information

Modified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain

Modified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain 152/304 CoDaWork 2017 Abbadia San Salvatore (IT) Modified Kolmogorov-Smirnov Test of Goodness of Fit G.S. Monti 1, G. Mateu-Figueras 2, M. I. Ortego 3, V. Pawlowsky-Glahn 2 and J. J. Egozcue 3 1 Department

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Regression Methods for Spatially Extending Traffic Data

Regression Methods for Spatially Extending Traffic Data Regression Methods for Spatially Extending Traffic Data Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli Università del Sannio ABSTRACT Traffic monitoring and network state estimation

More information

Local Polynomial Modelling and Its Applications

Local Polynomial Modelling and Its Applications Local Polynomial Modelling and Its Applications J. Fan Department of Statistics University of North Carolina Chapel Hill, USA and I. Gijbels Institute of Statistics Catholic University oflouvain Louvain-la-Neuve,

More information

Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2

Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2 Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2 1 Dipartimento di Studi Europei e Mediterranei, Seconda Università

More information

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture - 17 K - Nearest Neighbor I Welcome to our discussion on the classification

More information

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics

More information

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

Neural network time series classification of changes in nuclear power plant processes

Neural network time series classification of changes in nuclear power plant processes 2009 Quality and Productivity Research Conference Neural network time series classification of changes in nuclear power plant processes Karel Kupka TriloByte Statistical Research, Center for Quality and

More information

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig

MSCBD 5002/IT5210: Knowledge Discovery and Data Minig MSCBD 5002/IT5210: Knowledge Discovery and Data Minig Instructor: Lei Chen Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei and

More information

New models for symbolic data analysis

New models for symbolic data analysis New models for symbolic data analysis B. Beranger, H. Lin and S. A. Sisson arxiv:1809.03659v1 [stat.co] 11 Sep 2018 Abstract Symbolic data analysis (SDA) is an emerging area of statistics based on aggregating

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

ON INTERVAL ESTIMATING REGRESSION

ON INTERVAL ESTIMATING REGRESSION ON INTERVAL ESTIMATING REGRESSION Marcin Michalak Institute of Informatics, Silesian University of Technology, Gliwice, Poland Marcin.Michalak@polsl.pl ABSTRACT This paper presents a new look on the well-known

More information

Generalization of the Principal Components Analysis to Histogram Data

Generalization of the Principal Components Analysis to Histogram Data Generalization of the Principal Components Analysis to Histogram Data Oldemar Rodríguez 1, Edwin Diday 1, and Suzanne Winsberg 2 1 University Paris 9 Dauphine, Ceremade Pl Du Ml de L de Tassigny 75016

More information