Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid
|
|
- Edwina Arnold
- 5 years ago
- Views:
Transcription
1 Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid
2 Introduction 2
3 Symbolic data Symbolic data was introduced by Edwin Diday in 1987 to represent variability Symbolic variables make possible to describe groups of individuals and concepts Symbolic variables include List of values variables (with or without weights), Interval variables, and Histogram variables Symbolic representations can include internal structure (hierarchies) and logical dependency (rules) 3
4 Interval Data Interval data is by far the most popular symbolic representation Interval data is naturally used in several contexts: For example, Meteorology and Finance It is easier to propose methods for interval variables than for other symbolic variables Intervals are represented by two values: minimum and maximum, or center and radius The truly symbolic methods are those that deal with interval representations and not those that represent them by means of two classical variables 4
5 Histogram data Histogram data is the younger brother of interval data Histograms is a statistical tool Histogram data does not arise naturally Histograms requires the definition of parameters There are less methods developed for histogram data Because the histogram representation is considerably more sophisticated than the interval one However, the situation is changing in the recent years and histogram (and distributional) data is developing its potential Distributions are the numbers of the future present! 5
6 From histogram data to distributional data 6
7 From histogram data to distributional data Distributional variables make possible to describe each unit of interest by means of its observed data distribution A distribution does not summarize data by means of statistics, such as the mean, variance, minimum and maximum, etc. IT IS THE DATA! The distribution has to be represented in some way The practitioner can focus in the representation that fits better his problem Two types of representations: Binned density estimators, such as histograms Smooth and continue density estimators 7
8 Binned density estimators Binned density estimators offer a great variety of choices for the analyst Histograms with a fixed bin width (classical histogram) If an accurate representation is needed, there are rules to determine the optimal bin width (Wand, 1997) Equifrequency histograms (such as, boxplots) Histograms defined on a given partition of the range of the variable of interest Fink et al. (2009) propose a method to accurately estimate histograms with variable-width bars Histograms defined as a sequence of quantiles of specific interest The frequency poligon Delicado and Del Rio (2003) propose a generalization of frequency poligons to accurately estimate distributions 8
9 Binned density estimators Binned density estimators have some problems too Not smooth Depend on end points of bins Depend on width of bins 9
10 Smooth estimators If the analyst wants a smooth representation, he/she can use parametric methods, if the underlying distribution is known Kernel density estimation methods (Simonoff, 1998) require A kernel: uniform, triangular, Epanechnikov, normal, etc The selection of the bandwidth of the kernel As small as the data allow, Trade-off between the bias of the estimator and its variance It is usually chosen the value that minizes the AMISE (Asymptotic Mean Integrated Squared Error) Jones et al. (1996) survey of methods to estimate the bandwith 10
11 The Quantile Function 11
12 The quantile function The quantile function (QF) is the inverse of the cumulative distribution function Density function Cumulative distribution function Quantile function 12
13 The quantile function The QF provides a common framework to represent data described by histograms, intervals, nominal multi-value types, etc (Ichino, 2011) The QF is the conceptual underpinning of many of the methods proposed for histogram data The QF has some interesting features: Fixed range for the X-axis [0,1] The Wasserstein metrics for QF make possible to compute distribution-valued central tendency moments and its associated dispersion measures of distributional data These distribution-valued moments are the basis to propose many methods for distributional data, E.g. methods based on the concept of average 13
14 The catalogue of methods 14
15 Methods proposed for distributional data Brito (2012) provides a very nice survey on the topic The catalogue of methods for distributional data includes Descriptive statistics Regression Clustering Dimensionality reduction Time series forecasting Visualization methods 15
16 Methods proposed for distributional data Descriptive statistics Bertrand and Goupil (2000) proposes real-valued univariate and bivariate statistics, extended by Billard and Diday (2006) Irpino and Verde (2014) proposes a new set of real-valued statistics based on the l 2 Wasserstein distance Rivoli et al. (2012) propose central tendency moments in distribution form based on the family of Wasserstein metrics Regression Billard and Diday (2006) proposed the first model based on real-valued univariate and bivariate statistics Irpino and Verde (2012) and Dias and Brito (2013) propose regression models based on the QF and the l 2 Wasserstein distance 16
17 Methods proposed for distributional data Clustering Many approaches to cluster this kind of data The simple hierarchical models just need an appropriate distance Irpino and Verde (2006) propose a dynamic clustering method based on the l 2 Wasserstein distance and that averaged histograms for the first time Brito and Ichino (2011) propose hierarchical clustering methods based on quantile representations of the different types of data Brito and Chavent (2012) propose a divisive algorithm that works with interval and/or histogram-valued variables using appropriate distances 17
18 Methods proposed for distributional data Dimensionality reduction Rodriguez et al. (2000) propose a Principal Component Analysis for histograms, where histograms are a succession of nested intervals Nagabhushan and Pradeep Kumar (2007) propose a histogram arithmetic to extend PCA for a simplified version of histogram data (more close to compositional data) Delicado (2011) extends PCA to density functions using functional data methods and tools from compositional data (Egozcue et al., 2006) Ichino (2011) adapts the Principal Components Analysis to work with symbolic data using the quantile representation Time series forecasting Arroyo and Maté (2009) adapts the k-nn Arroyo et al. (2011) adapts exponential smoothing methods All these methods are based on Wasserstein metrics 18
19 Methods proposed for distributional data Visualization methods Sopan et al. (2013) propose methods to intuitively visualize distributional data sets Distributions of distributions 19
20 How do methods work with both representations? Methods based on quantile functions and/or distances for quantile functions They do not need adaptation to work with both histograms and smooth distributions Methods based on bin-representations They are meant to deal with histogram representations They can be adapted to work with smooth representations First, the smooth representation has to be estimated Second, the smooth representation is transformed into a histogram representation by a sufficiently large number of bins 20
21 How do methods work with both representations? The smooth representation can be transformed into a histogram by a sufficiently large number of bins 21
22 How about the performance of this trick? l2 Wasserstein distance for histogram data Matlab 7 R2010 on a laptop PC 1.8GHz 2 cores 8GB RAM Simulate two sets of 10 6 data from N(1,1) and N(3,1) Estimate the two quantile histograms and measure l 2 Wasserstein distance between them (x1000 times) # bins time (sec) distance Smooth - 2 The matlab code still could be optimized 22
23 Some contributions from other fields 23
24 Distributions as seen from other fields Compositional data (Aitchison, 1986) is a related field Non-negative data with constant sum that provide a quantitative description of the parts of some whole Egozcue et al. (2006) say that density functions are infinite dimensional compositional data Distributional data can be considered as a particular case of functional data (Ramsay and Silverman, 1997) It is not direct to extend functional methods to distributional data Pointwise operations such as linear combination for functional data do not work Some alternatives to operate with them are needed Aitchison geometry (Aitchison, 1986): operations and distances Convolution operators 24
25 Applications everywhere! 25
26 Applications everywhere! Image analysis Images are represented by histograms Applications in fields such as precision agriculture or medicine (Sharma et al. 2013) Sensor and radar data Spatial and temporal aggregation For example, river level data and cloud sensor data Wind power production: aggregation of the generators in a wind farm Finance and Economics Contemporaneous and temporal aggregation of financial returns Applications in portfolio management and Value at Risk (Arroyo et al. 2011) 26
27 Applications everywhere! Official Statistics (distributions of variables in the population) Population pyramids (Delicado, 2011) Income distributions (Kneip and Utikal, 2001; Delicado, 2007) Rating distributions Recommendation systems and rating webs (Sopan et al., 2013) Trust and Reputation systems Opportunities in big data SDA 2014 & COMPSTAT 2014 sessions! and in many other fields E.g. social networks data and computer logs (web, search engines, etc) If a mean or a sample is used as a tool for aggregation to make the data manageable, then it is possible to use a distribution 27
28 Conclusions Distributional data is now a mature field in terms of theory Homogeneous conceptual underpinning Diverse catalogue of methods High applicability in real life problems However, it is barely known and barely used outside the symbolic family. So, we have a mission: Spread the word Propose interesting applications Possible synergies with other fields, such as functional data or compositional data 28
29 How can I analyze my distributional data? There is a plan to develop a toolbox for distributional data Matlab, Octave and R Open Source For binned and smooth representations To include: Distribution estimation and visualization Descriptive Statistics Clustering Regression Time Series forecasting Eventually, all the methods proposed People involved: Antonio Irpino, Antonio Balzanella, Sonia Dias and Javier Arroyo 29
30 References 30
31 References Aitchison, J. (1986). The Statistical Analysis of Compositional Data. Chapman and Hall Arroyo, J., Maté, C. (2009). Forecasting histogram time series with k-nearest neighbours methods. International Journal of Forecasting 25 (1), Arroyo, J., Mate, C., Muñoz San Roque, A. (2011). Smoothing Methods for Histogram-valued Time Series. An application to Value-at-Risk. Statistical Analysis and Data Mining 4, Bertrand, P., Goupil, F. (2000). Descriptive statistics for symbolic data. In: Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, pp Billard, L., Diday, E. (2006). Symbolic data analysis: conceptual statistics and data mining. Wiley Brito, P. (2012). Beyond summaries of individual data: Analyzing distributions. In Symposium on Learning and Data Science,
32 References Brito, P. and Chavent, M. (2012). Divisive Monothetic Clustering for Interval and Histogram-Valued Data. In: Proc. ICPRAM st International Conference on Pattern Recognition Applications and Methods Brito, P. and Ichino, M. (2011). Conceptual Clustering of Symbolic Data Using a Quantile Representation: Discrete and Continuous Approaches. In: Proc. Workshop on Theory and Application of High-dimensional Complex and Symbolic Data Analysis in Economics and Management Science Delicado, P. (2007). Functional k -sample problem when data are density functions. Computational Statistics 22 (3), Delicado, P. (2011). Dimensionality reduction when data are density functions. Computational Statistics and Data Analysis 55 (1), Delicado, P. and del Río, M. (2003). A Generalization of Histogram Type Estimators. Journal of Nonparametric Statistics 15 (1),
33 References Dias, S., Brito, P. (2013). Distribution and Symmetric Distribution Regression Model for Histogram-Valued Variables. arxiv: Egozcue, J.J., Díaz-Barrero, J.L., Pawlowsky-Glahn, V. (2006). Hilbert space of probability density functions based on Aitchison geometry. Acta Mathematica Sinica 22, Fink, E., Sarin, A., Carbonell, J. (2009). Analysis of uncertain data: Smoothing of histograms. Proc. of the IEEE International Conference on Systems, Man and Cybernetics, pp Ichino, M. (2011). The quantile method for symbolic principal component analysis. Statistical Analysis and Data Mining 4, Irpino, A., Verde, R (2006). A new Wasserstein based distance for the hierarchical clustering of histogram symbolic data. In Data Science and Classification, pp
34 References Irpino, A., Verde, R (2012). Linear regression for numeric symbolic variables: a least squares approach based on Wasserstein Distance. arxiv v3 Irpino, A., Verde, R (2014). Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification Jones, M.C., Marron, J.S., Sheather, S. J. (1996). A brief survey of bandwidth selection for density estimation. Journal of the American Statistical Association 91 (433), Kneip A., Utikal K. (2001). Inference for density families using functional principal component analysis. Journal of the American Statistical Association 96, Nagabhushan, P., Pradeep Kumar, R. (2007). Histogram PCA. In Proc. of the 4th International Symposium on Neural Networks, pp Ramsay, J.O., Silverman, B. W. (1997). Functional Data Analysis. Springer Rivoli, L., Irpino, A., Verde, R. (2012). The median of a set of histogram data. In XLVI meeting of theitalia Statistical Society 34
35 References Rodriguez O., Diday E., Winsberg S. (2000). Generalization of the Principal Components Analysis to Histogram Data. In: Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Data Bases Schweizer, B. (1984). Distributions are the numbers of the future. In Proceedings of the mathematics of fuzzy systems meeting (Naples, Italy), pp Sharma, A. et al (2013). Spatiotemporal modeling of discrete-time distribution-valued data applied to DTI tract evolution in infant neurodevelopment. In Proc. IEEE International Symposium Biomedical Imaging. pp Simonoff, J. S. (1998). Smoothing Methods in Statistics. Springer Sopan, A., Freire, M., Taieb-Maimon, M., Plaisant, C., Golbeck, J., Shneiderman, B. (2013). Exploring data distributions: Visual design and evaluation. International Journal of Human-Computer Interaction 29 (2), Wand, M. P. (1997). Data-based choice of histogram bin width. The American Statistician 51(1),
A new linear regression model for histogram-valued variables
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do
More informationHow to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data
for aggregated data Rosanna Verde (rosanna.verde@unina2.it) Antonio Irpino (antonio.irpino@unina2.it) Dominique Desbois (desbois@agroparistech.fr) Second University of Naples Dept. of Political Sciences
More informationHistogram data analysis based on Wasserstein distance
Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances
More informationHistogram data analysis based on Wasserstein distance
Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND
More informationOrder statistics for histogram data and a box plot visualization tool
Order statistics for histogram data and a box plot visualization tool Rosanna Verde, Antonio Balzanella, Antonio Irpino Second University of Naples, Caserta, Italy rosanna.verde@unina.it, antonio.balzanella@unina.it,
More informationForecasting Complex Time Series: Beanplot Time Series
COMPSTAT 2010 19 International Conference on Computational Statistics Paris-France, August 22-27 Forecasting Complex Time Series: Beanplot Time Series Carlo Drago and Germana Scepi Dipartimento di Matematica
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationBayes spaces: use of improper priors and distances between densities
Bayes spaces: use of improper priors and distances between densities J. J. Egozcue 1, V. Pawlowsky-Glahn 2, R. Tolosana-Delgado 1, M. I. Ortego 1 and G. van den Boogaart 3 1 Universidad Politécnica de
More informationJ. Cwik and J. Koronacki. Institute of Computer Science, Polish Academy of Sciences. to appear in. Computational Statistics and Data Analysis
A Combined Adaptive-Mixtures/Plug-In Estimator of Multivariate Probability Densities 1 J. Cwik and J. Koronacki Institute of Computer Science, Polish Academy of Sciences Ordona 21, 01-237 Warsaw, Poland
More informationStatistics Toolbox 6. Apply statistical algorithms and probability models
Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of
More informationCoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;
CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.
More informationIntroduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones
Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 Last week... supervised and unsupervised methods need adaptive
More informationA NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE
BRAC University Journal, vol. V1, no. 1, 2009, pp. 59-68 A NOTE ON THE CHOICE OF THE SMOOTHING PARAMETER IN THE KERNEL DENSITY ESTIMATE Daniel F. Froelich Minnesota State University, Mankato, USA and Mezbahur
More informationModelling and Analysing Interval Data
Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss
More informationBNG 495 Capstone Design. Descriptive Statistics
BNG 495 Capstone Design Descriptive Statistics Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential statistical methods, with a focus
More informationMidwest Big Data Summer School: Introduction to Statistics. Kris De Brabanter
Midwest Big Data Summer School: Introduction to Statistics Kris De Brabanter kbrabant@iastate.edu Iowa State University Department of Statistics Department of Computer Science June 20, 2016 1/27 Outline
More informationSTATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic
STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab
More informationNonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationDiscriminant Analysis for Interval Data
Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationNonparametric Functional Data Analysis
Frédéric Ferraty and Philippe Vieu Nonparametric Functional Data Analysis Theory and Practice April 18, 2006 Springer Berlin Heidelberg New York Hong Kong London Milan Paris Tokyo Preface This work is
More informationOn central tendency and dispersion measures for intervals and hypercubes
On central tendency and dispersion measures for intervals and hypercubes Marie Chavent, Jérôme Saracco To cite this version: Marie Chavent, Jérôme Saracco. On central tendency and dispersion measures for
More informationLectures in AstroStatistics: Topics in Machine Learning for Astronomers
Lectures in AstroStatistics: Topics in Machine Learning for Astronomers Jessi Cisewski Yale University American Astronomical Society Meeting Wednesday, January 6, 2016 1 Statistical Learning - learning
More informationA Program for Data Transformations and Kernel Density Estimation
A Program for Data Transformations and Kernel Density Estimation John G. Manchuk and Clayton V. Deutsch Modeling applications in geostatistics often involve multiple variables that are not multivariate
More informationA Nonparametric Kernel Approach to Interval-Valued Data Analysis
A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,
More informationLinear Regression Model with Histogram-Valued Variables
Linear Regression Model with Histogram-Valued Variables Sónia Dias 1 and Paula Brito 1 INESC TEC - INESC Technology and Science and ESTG/IPVC - School of Technology and Management, Polytechnic Institute
More informationContents. Acknowledgments. xix
Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables
More informationcomplex data Edwin Diday, University i Paris-Dauphine, France CEREMADE, Beijing 2011
Symbolic data analysis of complex data Edwin Diday, CEREMADE, University i Paris-Dauphine, France Beijing 2011 OUTLINE What is the Symbolic Data Analysis (SDA) paradigm? Why SDA is a good tool for Complex
More informationCo-clustering algorithms for histogram data
Co-clustering algorithms for histogram data Algoritmi di Co-clustering per dati ad istogramma Francisco de A.T. De Carvalho and Antonio Balzanella and Antonio Irpino and Rosanna Verde Abstract One of the
More information12 - Nonparametric Density Estimation
ST 697 Fall 2017 1/49 12 - Nonparametric Density Estimation ST 697 Fall 2017 University of Alabama Density Review ST 697 Fall 2017 2/49 Continuous Random Variables ST 697 Fall 2017 3/49 1.0 0.8 F(x) 0.6
More informationMAT Mathematics in Today's World
MAT 1000 Mathematics in Today's World Last Time 1. Three keys to summarize a collection of data: shape, center, spread. 2. Can measure spread with the fivenumber summary. 3. The five-number summary can
More informationAnalysis of Interest Rate Curves Clustering Using Self-Organising Maps
Analysis of Interest Rate Curves Clustering Using Self-Organising Maps M. Kanevski (1), V. Timonin (1), A. Pozdnoukhov(1), M. Maignan (1,2) (1) Institute of Geomatics and Analysis of Risk (IGAR), University
More informationClustering and Model Integration under the Wasserstein Metric. Jia Li Department of Statistics Penn State University
Clustering and Model Integration under the Wasserstein Metric Jia Li Department of Statistics Penn State University Clustering Data represented by vectors or pairwise distances. Methods Top- down approaches
More informationDescriptive Statistics for Symbolic Data
Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationFast Hierarchical Clustering from the Baire Distance
Fast Hierarchical Clustering from the Baire Distance Pedro Contreras 1 and Fionn Murtagh 1,2 1 Department of Computer Science. Royal Holloway, University of London. 57 Egham Hill. Egham TW20 OEX, England.
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationNonparametric Inference via Bootstrapping the Debiased Estimator
Nonparametric Inference via Bootstrapping the Debiased Estimator Yen-Chi Chen Department of Statistics, University of Washington ICSA-Canada Chapter Symposium 2017 1 / 21 Problem Setup Let X 1,, X n be
More informationCourse in Data Science
Course in Data Science About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst. The course gives an
More informationPrentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)
National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the
More informationInterval-Based Composite Indicators
University of Rome Niccolo Cusano Conference of European Statistics Stakeholders 22 November 2014 1 Building Composite Indicators 2 (ICI) 3 Constructing ICI 4 Application on real data Composite Indicators
More informationO Combining cross-validation and plug-in methods - for kernel density bandwidth selection O
O Combining cross-validation and plug-in methods - for kernel density selection O Carlos Tenreiro CMUC and DMUC, University of Coimbra PhD Program UC UP February 18, 2011 1 Overview The nonparametric problem
More informationWeek 1: Intro to R and EDA
Statistical Methods APPM 4570/5570, STAT 4000/5000 Populations and Samples 1 Week 1: Intro to R and EDA Introduction to EDA Objective: study of a characteristic (measurable quantity, random variable) for
More informationUpdating on the Kernel Density Estimation for Compositional Data
Updating on the Kernel Density Estimation for Compositional Data Martín-Fernández, J. A., Chacón-Durán, J. E., and Mateu-Figueras, G. Dpt. Informàtica i Matemàtica Aplicada, Universitat de Girona, Campus
More informationAppendix F. Computational Statistics Toolbox. The Computational Statistics Toolbox can be downloaded from:
Appendix F Computational Statistics Toolbox The Computational Statistics Toolbox can be downloaded from: http://www.infinityassociates.com http://lib.stat.cmu.edu. Please review the readme file for installation
More informationChapter 9. Non-Parametric Density Function Estimation
9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least
More informationVariables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010
Variables, distributions, and samples (cont.) Phil 12: Logic and Decision Making Fall 2010 UC San Diego 10/18/2010 Review Recording observations - Must extract that which is to be analyzed: coding systems,
More information41903: Introduction to Nonparametrics
41903: Notes 5 Introduction Nonparametrics fundamentally about fitting flexible models: want model that is flexible enough to accommodate important patterns but not so flexible it overspecializes to specific
More informationEconometrics I. Professor William Greene Stern School of Business Department of Economics 1-1/40. Part 1: Introduction
Econometrics I Professor William Greene Stern School of Business Department of Economics 1-1/40 http://people.stern.nyu.edu/wgreene/econometrics/econometrics.htm 1-2/40 Overview: This is an intermediate
More informationBasic Statistical Tools
Structural Health Monitoring Using Statistical Pattern Recognition Basic Statistical Tools Presented by Charles R. Farrar, Ph.D., P.E. Los Alamos Dynamics Structural Dynamics and Mechanical Vibration Consultants
More informationSTATISTICS ( CODE NO. 08 ) PAPER I PART - I
STATISTICS ( CODE NO. 08 ) PAPER I PART - I 1. Descriptive Statistics Types of data - Concepts of a Statistical population and sample from a population ; qualitative and quantitative data ; nominal and
More informationEstimation of cumulative distribution function with spline functions
INTERNATIONAL JOURNAL OF ECONOMICS AND STATISTICS Volume 5, 017 Estimation of cumulative distribution function with functions Akhlitdin Nizamitdinov, Aladdin Shamilov Abstract The estimation of the cumulative
More informationKernel Density Estimation
Kernel Density Estimation and Application in Discriminant Analysis Thomas Ledl Universität Wien Contents: Aspects of Application observations: 0 Which distribution? 0?? 0.0 0. 0. 0. 0.0 0. 0. 0 0 0.0
More informationAutocorrelation function of the daily histogram time series of SP500 intradaily returns
Autocorrelation function of the daily histogram time series of SP5 intradaily returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA 9252 Javier Arroyo
More informationAdaptive Nonparametric Density Estimators
Adaptive Nonparametric Density Estimators by Alan J. Izenman Introduction Theoretical results and practical application of histograms as density estimators usually assume a fixed-partition approach, where
More informationDescriptive Univariate Statistics and Bivariate Correlation
ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to
More informationStatistics and parameters
Statistics and parameters Tables, histograms and other charts are used to summarize large amounts of data. Often, an even more extreme summary is desirable. Statistics and parameters are numbers that characterize
More informationDescriptive Data Summarization
Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning
More informationConfidence intervals for kernel density estimation
Stata User Group - 9th UK meeting - 19/20 May 2003 Confidence intervals for kernel density estimation Carlo Fiorio c.fiorio@lse.ac.uk London School of Economics and STICERD Stata User Group - 9th UK meeting
More informationStochastic Hydrology. a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs
Stochastic Hydrology a) Data Mining for Evolution of Association Rules for Droughts and Floods in India using Climate Inputs An accurate prediction of extreme rainfall events can significantly aid in policy
More informationExploratory Spatial Data Analysis (ESDA)
Exploratory Spatial Data Analysis (ESDA) VANGHR s method of ESDA follows a typical geospatial framework of selecting variables, exploring spatial patterns, and regression analysis. The primary software
More informationTime Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP500 Intradaily Returns
Time Series Modeling of Histogram-valued Data The Daily Histogram Time Series of SP5 Intradaily Returns Gloria González-Rivera University of California, Riverside Department of Economics Riverside, CA
More informationNemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015
Nemours Biomedical Research Biostatistics Core Statistics Course Session 4 Li Xie March 4, 2015 Outline Recap: Pairwise analysis with example of twosample unpaired t-test Today: More on t-tests; Introduction
More informationIntroduction to statistics
Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley Schickinger, Steger: Diskrete Strukturen Band 2, Springer David Lilja: Measuring Computer Performance:
More informationPositive data kernel density estimation via the logkde package for R
Positive data kernel density estimation via the logkde package for R Andrew T. Jones 1, Hien D. Nguyen 2, and Geoffrey J. McLachlan 1 which is constructed from the sample { i } n i=1. Here, K (x) is a
More informationNeural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology
Neural Networks and Machine Learning research at the Laboratory of Computer and Information Science, Helsinki University of Technology Erkki Oja Department of Computer Science Aalto University, Finland
More informationFUNCTIONAL DATA ANALYSIS. Contribution to the. International Handbook (Encyclopedia) of Statistical Sciences. July 28, Hans-Georg Müller 1
FUNCTIONAL DATA ANALYSIS Contribution to the International Handbook (Encyclopedia) of Statistical Sciences July 28, 2009 Hans-Georg Müller 1 Department of Statistics University of California, Davis One
More informationMallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data
Metodološki zvezki, Vol. 9, No. 2, 212, 17-118 Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Katarina Košmelj 1 and Lynne Billard 2 Abstract Mallows L 2 distance
More informationSampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software
Sampling: A Brief Review Workshop on Respondent-driven Sampling Analyst Software 201 1 Purpose To review some of the influences on estimates in design-based inference in classic survey sampling methods
More informationChoosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation
Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble
More informationChapter 9. Non-Parametric Density Function Estimation
9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationA PROBABILITY DENSITY FUNCTION ESTIMATION USING F-TRANSFORM
K Y BERNETIKA VOLUM E 46 ( 2010), NUMBER 3, P AGES 447 458 A PROBABILITY DENSITY FUNCTION ESTIMATION USING F-TRANSFORM Michal Holčapek and Tomaš Tichý The aim of this paper is to propose a new approach
More information3 Joint Distributions 71
2.2.3 The Normal Distribution 54 2.2.4 The Beta Density 58 2.3 Functions of a Random Variable 58 2.4 Concluding Remarks 64 2.5 Problems 64 3 Joint Distributions 71 3.1 Introduction 71 3.2 Discrete Random
More informationMethodological Concepts for Source Apportionment
Methodological Concepts for Source Apportionment Peter Filzmoser Institute of Statistics and Mathematical Methods in Economics Vienna University of Technology UBA Berlin, Germany November 18, 2016 in collaboration
More informationPrerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3
University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationIntensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis
Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 4 Spatial Point Patterns Definition Set of point locations with recorded events" within study
More informationLQ-Moments for Statistical Analysis of Extreme Events
Journal of Modern Applied Statistical Methods Volume 6 Issue Article 5--007 LQ-Moments for Statistical Analysis of Extreme Events Ani Shabri Universiti Teknologi Malaysia Abdul Aziz Jemain Universiti Kebangsaan
More informationDIMENSION REDUCTION AND CLUSTER ANALYSIS
DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833
More informationKernel density estimation in R
Kernel density estimation in R Kernel density estimation can be done in R using the density() function in R. The default is a Guassian kernel, but others are possible also. It uses it s own algorithm to
More informationProbabilistic Energy Forecasting
Probabilistic Energy Forecasting Moritz Schmid Seminar Energieinformatik WS 2015/16 ^ KIT The Research University in the Helmholtz Association www.kit.edu Agenda Forecasting challenges Renewable energy
More informationInstitute of Actuaries of India
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2018 Examinations Subject CT3 Probability and Mathematical Statistics Core Technical Syllabus 1 June 2017 Aim The
More informationAkaike Information Criterion to Select the Parametric Detection Function for Kernel Estimator Using Line Transect Data
Journal of Modern Applied Statistical Methods Volume 12 Issue 2 Article 21 11-1-2013 Akaike Information Criterion to Select the Parametric Detection Function for Kernel Estimator Using Line Transect Data
More informationModified Kolmogorov-Smirnov Test of Goodness of Fit. Catalonia-BarcelonaTECH, Spain
152/304 CoDaWork 2017 Abbadia San Salvatore (IT) Modified Kolmogorov-Smirnov Test of Goodness of Fit G.S. Monti 1, G. Mateu-Figueras 2, M. I. Ortego 3, V. Pawlowsky-Glahn 2 and J. J. Egozcue 3 1 Department
More informationTime Series and Forecasting Lecture 4 NonLinear Time Series
Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations
More informationMachine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)
Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationRegression Methods for Spatially Extending Traffic Data
Regression Methods for Spatially Extending Traffic Data Roberto Iannella, Mariano Gallo, Giuseppina de Luca, Fulvio Simonelli Università del Sannio ABSTRACT Traffic monitoring and network state estimation
More informationLocal Polynomial Modelling and Its Applications
Local Polynomial Modelling and Its Applications J. Fan Department of Statistics University of North Carolina Chapel Hill, USA and I. Gijbels Institute of Statistics Catholic University oflouvain Louvain-la-Neuve,
More informationProximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2
Proximity Measures for Data Described By Histograms Misure di prossimità per dati descritti da istogrammi Antonio Irpino 1, Yves Lechevallier 2 1 Dipartimento di Studi Europei e Mediterranei, Seconda Università
More informationData Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur
Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Lecture - 17 K - Nearest Neighbor I Welcome to our discussion on the classification
More information2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS
Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics
More informationMS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter
MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises
More informationI L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,
More informationNeural network time series classification of changes in nuclear power plant processes
2009 Quality and Productivity Research Conference Neural network time series classification of changes in nuclear power plant processes Karel Kupka TriloByte Statistical Research, Center for Quality and
More informationMSCBD 5002/IT5210: Knowledge Discovery and Data Minig
MSCBD 5002/IT5210: Knowledge Discovery and Data Minig Instructor: Lei Chen Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei and
More informationNew models for symbolic data analysis
New models for symbolic data analysis B. Beranger, H. Lin and S. A. Sisson arxiv:1809.03659v1 [stat.co] 11 Sep 2018 Abstract Symbolic data analysis (SDA) is an emerging area of statistics based on aggregating
More informationTable of Contents. Multivariate methods. Introduction II. Introduction I
Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation
More informationON INTERVAL ESTIMATING REGRESSION
ON INTERVAL ESTIMATING REGRESSION Marcin Michalak Institute of Informatics, Silesian University of Technology, Gliwice, Poland Marcin.Michalak@polsl.pl ABSTRACT This paper presents a new look on the well-known
More informationGeneralization of the Principal Components Analysis to Histogram Data
Generalization of the Principal Components Analysis to Histogram Data Oldemar Rodríguez 1, Edwin Diday 1, and Suzanne Winsberg 2 1 University Paris 9 Dauphine, Ceremade Pl Du Ml de L de Tassigny 75016
More information