How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data

Size: px
Start display at page:

Download "How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data"

Transcription

1 for aggregated data Rosanna Verde Antonio Irpino Dominique Desbois Second University of Naples Dept. of Political Sciences J. Monnet 1 of 30

2 Motivations and aims of the talk Motivation Regulation of official statistical institutes does not allow the diffusion of microdata for privacy-related purposes. In general, it is easier to obtain aggregated data of a set of individuals. Most of the modelling tools in statistics (e.g., regression) work on microdata and cannot be easily extended to macrodata. Methods In this talk, we show the use of a regression method developed, where both the explanatory and the response variables present quantile distributions as observations. A PCA method on quantile data is used in order to visualize relationships between the predicted distributions. Application The analysis has been performed on a dataset of economic indicators related to the specific cost of agriculture products in France regions. 2 of 30

3 DATA We observed data coming from RICA, the French Farm Accounting Data Network (FADN), aggregated in 22 metropolitan regions of France. CODE REGION CODE REGION 121 Île de France 162 Pays de la Loire 131 Champagne-Ardenne 163 Bretagne 132 Picardie 164 Poitou-Charentes 133 Haute-Normandie 182 Aquitaine 134 Centre 183 Midi-Pyrénées 135 Basse-Normandie 184 Limousin 136 Bourgogne 192 Rhônes-Alpes 141 Nord-Pas-de-Calais 193 Auvergne 151 Lorraine 201 Languedoc-Roussillon 152 Alsace 203 Provence-Alpes-Côte dazur 153 Franche-Comté 204 Corse 3 of 30

4 Economic indicators available for each region o o o o Y_TSC Total Specific Cost (TSC) of farm holdings, X_WHEAT the wheat output variable; X_PIG the pig output variable; X_MILKC - the cow milk output variable; The available data o Each region is described by the vector of the estimates of the 10 deciles of the distribution observed for each French region; Not-available information (for privacy concerns) o Raw data are not available: for each farm we do not know data about the four variables o o We do not know association structure within each region We do not know the number of farms observed for each region 4 of 30

5 An example of a row of the data table Y_TSC X_Wheat X_Pig X_Cmilk CDF_Plot Bretagne Histogram Bretagne Smoothed histogram Bretagne 5 of 30

6 The data table: CDFs (Cumulative distribution functions) and corresponding histograms 6 of 30

7 A first research question? It is possible to predict Y_TSC from the other variables Classic methods of regression cannot be used with this kind of data Proposal We may use Histogram-valued data analysis The regression for quantile functions: Verde-Irpino regression With each quantile function is associated a distribution Irpino-Verde regression is a novel method for the regression analysis of distributional data. 7 of 30

8 A regression model for histogram variables based on Wasserstein distance 8 of 30

9 A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? 0,5 0,4 0,3 0,2 0,1 0,3 0,4 0,2 0,1 0,15 0,45 0,3 0,1 Billard, Diday, IFCS 2006 Verde, Irpino, COMPSTAT 2010; CLADAG 2011 Dias, Brito, ISI of 30

10 Linear Regression Model for histogram data (Verde, Irpino, 2013) Given a histogram variable X, we search for a linear transformation of X which allows us to predict the histogram variable Y For example: given the histogram of the Y_TSC observed in a region, is it possible to predict the distribution of the Y_TSC using a linear combination of the predictor histogram variables? 10 of 30

11 Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p = 0 + i j ij + i j= 1 y () t β β x () t ε () t Yt () = Xt () β + ε() t Quantile functions associated with histogram/ distribution data This formulation is analogous to the functional linear model (Ramsay, Silverman, 2003) except for the constant β parameters and for the functions y i (t), x ij (t) which are quantile functions while each ε i (t) is a residual function (distribution?) for all i=1,, n. 11 of 30

12 Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: 1 p 2 εi ( y i(t), ŷi(t) ) = y β0 β i(t) j x ij(t) dt 0 j= 1 2 Squared error based on the Wasserstein l 2 distance between two quantile functions W 0 ( ) ( ) i j = i j d x,x F (t) F (t) dt 2 Wasserstein l 2 distance between two quantile functions 12 of 30

13 Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: The linear transformation is unique: the parameters β 0 estimated for all the x ij and y i distributions A first problem: yˆ ( t) = β + β x ( t) t [0, 1] i 0 j ij j= 1 p Only if β j > 0 a quantile function yˆ () t can be derived. and β j are In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm. i 13 of 30

14 OLS estimate (Irpino and Verde, 2012) The quantile function can be decomposed as: x c () t = x + x () t where ij ij ij c x () t = x () t x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i β0 β j ij γ j ij i j= 1 j= 1 = y () t x x () t ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (β 0,β j ; γ j ). Under a positiveness constraint on γ j 14 of 30

15 Interpretation of the parameters Regression parameters for the distribution mean locations ˆ β ˆ,..., ˆ 0 β1 β R, p Shrinking factors for the variability y ˆ 1,..., ˆp γ γ R > 1 (< 1) the histogram has a greater (smaller) variability than ˆi the x ij histogram of 30

16 Advantages of the regression on quantile functions The regression on quantile functions takes into account the whole distribution (described by the quantiles). It is more powerful with respect a classic regression on the means of the distributions because it considers information about sizes and shapes of the distributions. It is different from the well-know Quantile regression which requires all microdata and estimates one quantile at time (independently from the others). In this case it is not guaranteed the order among the estimated quantile. Our methods works on aggregated data when microdata are not available, and estimates the quantiles using a single model. The method guarantees the natural order among the estimated quantiles. The method suffers less of the outlying observations, thus it guarantees a more robust estimation of the tails of the distributions. 16 of 30

17 Regression results (only the first 19 regions are used, the last three have regressors equal to zero) The estimated model 16, c X _ WHEAT _ ( i + X WHEATi t) Y _TSC i () t = t [0;1] c X _ PIG X _ PIG () t Goodness of fit indices Root Mean Square Error (Verde & Irpino, 2013): i X _ MILKC X i c _ MILKC () t RMSE=7,238.2; Omega index (Dias & Brito, 2011): Ω = (0 worst fitting, 1 best fitting); Pseudo R-squared (Verde & Irpino, 2013): PR 2 = (0 worst fitting, 1 best fitting). i i 17 of 30

18 Plot of observed CDFs vs predicted CDFs Observed Predicted 18 of 30

19 Plot observed vs predicted (zoom) 19 of 30

20 A visualization tool for distributions Motivations A distribution, being a function, is a high dimensional data. We observed the plots in the last two slides: This kind of visualization is not very communicative. It is difficult to compare different distributions visually. We need a visualization tool that organizes graphically the distributions according to a similarity criterion. A new visualization tool: Quantile PCA (Irpino and Verde, 2013) Chosen a fixed number of quantiles, Quantile PCA performs a principal component analysis on a single distributional variable (a column of the data table). 20 of 30

21 PCA of quantiles The X matrix decomposed in Q-PCA We fix a set of m quantiles Each individual is represented by a sequence of m+1 (including the minimum value) ordered values xi = min( xi) Qi 1 Qij Qi, m 1 Max( xi) X min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) min( x ) Q Q Q Max( x ) = i i,1 i, j im, 1 i m in( xn) QN,1 QN, j QNm, 1 Max( xn) 21 of 30

22 Average quantiles vector The m+1 quantile column variables are centered min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) X = min( xi) Qi,1 Qi, j Qim, 1 Max( xi) min( xn) QN,1 QN, j QNm, 1 Max( xn) x = min( x) Q1 Qj Qm 1 Max( x) X I N x = X c with I N the unitary vector of N elements Average quantiles A PCA on the variance-covariance matrix of quantiles is performed. Note: the trace of the covariance matrix is an approximation of a variance measure defined for a distributional-variable. 22 of 30

23 Eigenvalues and explained inertia Wasserstein-based Variance of the variable Y_TSC = x10 8 ; Trace of the quantile Variance-Covariance matrix = x10 8. Inertia % of explained % cum E x E x E x E x E x E x E x E x E x Eigenvalues E1 E2 E3 E4 E5 1,70E+07 2,23E+06 9,50E+05 2,70E+05 2,57E+08 Cum. perc. of explained inertia E1 E2 E3 E4 E5 E6 E7 E8 E9 Eigenvalues 23 of 30

24 The plot of variables: the Spanish-fan plot Median Upper quantiles Lower quantiles Comment: Great part of variability is due to differences on the right tail. (Right-skewness) 24 of 30

25 Principal Component Analysis of the Y_TSC variable: First factorial plane 25 of 30

26 PCA of the Y_TSC variable: First factorial plane (distribution colored according to the means) Higher mean Lower mean 26 of 30

27 PCA of the Y_TSC variable: First factorial plane (distribution coloured according to standard deviations) Lower std Higher std 27 of 30

28 PCA of the Y_TSC variable: First factorial plane a joint view Comment: means and stds seems slightly positive correlated, they are both related to right heavy tailed distributions 28 of 30

29 Conclusions In this talk, starting from aggregated data, and without knowing microdata, we showed that it is possible to analyze, predict and show such summary structures using new tools from distributional-valued data analysis, defined into a space of univariate distributions equipped with L2 Wasserstein metric. A regression technique is able to work with this kind of data and it provides accurate and interpretable (also for practitioners) results for the interpreting of the causal relationships. A PCA on quantiles is a promising tool for a fast and easy visualization of the different distribution features. An R package is going to be released in the next quarter. As a future work, a graphical analysis of predicted vs observed distribution-valued data can be introduced using more sophisticated factorial analysis. (This is in progress) 29 of 30

30 Main references 1. Arroyo, J., Maté, C.: Forecasting histogram time series with k-nearest neighbors methods, International Journal of Forecasting 25 (1), (2009) 2. Arroyo, J., González-Rivera, G., Maté C.: Forecasting with interval and histogram data. Some financial applications. Handbook of empirical economics and finance, (2010) 3. Dias, S., Brito, P.: A new linear regression model for histogram-valued variables, in: 58th ISI World Statistics Congress, Dublin, Ireland, URL: (2011) 4. Irpino, A., Verde, R.: Dimension reduction techniques for distributional symbolic data. In: SIS 2013 Statistical Conference Advances in Latent Variables - Methods, Models and Applications. URL: (2013) 5. Irpino, A. Verde, R. : Ordinary Least Squares for Histogram Data Based on Wasserstein Distance. In: COMPSTAT 2010, 19th Conference of IASC-ERS (Physica Verlag), pp (2010). 6. Irpino, A., Verde, R.: Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, doi: /s (in press, 2015) 7. Rüschendorf,, L.: Wasserstein metric, in Hazewinkel, M., Encyclopedia of Mathematics, Springer (2001) 8. Verde, R, Irpino, A.: Multiple Linear Regression for Histogram Data using Least Squares of Quantile Functions: a Two-components model. Revue des Nouvelles Technologies de L'Information, vol. RNTI- E-25, p (2013) 30 of 30

31 for aggregated data Thanks For Listening Antonio Irpino, Rosanna Verde, Dominique Desbois, November 25 th, of 30

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances

More information

Histogram data analysis based on Wasserstein distance

Histogram data analysis based on Wasserstein distance Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND

More information

Order statistics for histogram data and a box plot visualization tool

Order statistics for histogram data and a box plot visualization tool Order statistics for histogram data and a box plot visualization tool Rosanna Verde, Antonio Balzanella, Antonio Irpino Second University of Naples, Caserta, Italy rosanna.verde@unina.it, antonio.balzanella@unina.it,

More information

Geo-Spatial Technologies for Customs From Information to Informed Actions. Tokyo, November 1 st 2017

Geo-Spatial Technologies for Customs From Information to Informed Actions. Tokyo, November 1 st 2017 Geo-Spatial Technologies for Customs From Information to Informed Actions Tokyo, November 1 st 2017 Geospatial Technologies Geographic Information Systems Remote Sensing Examples 2 Outline Geographic Information

More information

Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid

Distributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid Introduction 2 Symbolic data Symbolic data was introduced by Edwin

More information

A new linear regression model for histogram-valued variables

A new linear regression model for histogram-valued variables Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do

More information

Okun s Law in the French Regions: A Cross-Regional Comparison

Okun s Law in the French Regions: A Cross-Regional Comparison Okun s Law in the French Regions: A Cross-Regional Comparison Marie-Estelle Binet, François Facchini To cite this version: Marie-Estelle Binet, François Facchini. Okun s Law in the French Regions: A Cross-

More information

Gradewood: Grading of timber for engineered wood products

Gradewood: Grading of timber for engineered wood products Gradewood: Grading of timber for engineered wood products Alpo Ranta-Maunus VTT Finland COST E53 in Oslo, 19 May 2008 Project Objectives and Main Tasks VTT BUILDING AND TRANSPORT Technical results of project

More information

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data

Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Metodološki zvezki, Vol. 9, No. 2, 212, 17-118 Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Katarina Košmelj 1 and Lynne Billard 2 Abstract Mallows L 2 distance

More information

Principal Component Analysis for Interval Data

Principal Component Analysis for Interval Data Outline Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account Outline Outline 1 Introduction to

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Linear Regression Model with Histogram-Valued Variables

Linear Regression Model with Histogram-Valued Variables Linear Regression Model with Histogram-Valued Variables Sónia Dias 1 and Paula Brito 1 INESC TEC - INESC Technology and Science and ESTG/IPVC - School of Technology and Management, Polytechnic Institute

More information

Review of Statistics

Review of Statistics Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and

More information

Forecasting 1 to h steps ahead using partial least squares

Forecasting 1 to h steps ahead using partial least squares Forecasting 1 to h steps ahead using partial least squares Philip Hans Franses Econometric Institute, Erasmus University Rotterdam November 10, 2006 Econometric Institute Report 2006-47 I thank Dick van

More information

Working Paper #37 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS

Working Paper #37 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS Working Paper #37 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS Cyril MARTIN de LAGARDE 12.2018 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS

More information

Interval-Based Composite Indicators

Interval-Based Composite Indicators University of Rome Niccolo Cusano Conference of European Statistics Stakeholders 22 November 2014 1 Building Composite Indicators 2 (ICI) 3 Constructing ICI 4 Application on real data Composite Indicators

More information

Alternative management of insect pests on oilseed rape in winter and spring.

Alternative management of insect pests on oilseed rape in winter and spring. EPPO Workshop on integrated management of insect pests in oilseed rape JKI, Berlin, 2017-09-20/22 Alternative management of insect pests on oilseed rape in winter and spring. Laurent Ruck (1), Céline Robert

More information

A Resampling Approach for Interval-Valued Data Regression

A Resampling Approach for Interval-Valued Data Regression A Resampling Approach for Interval-Valued Data Regression Jeongyoun Ahn, Muliang Peng, Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA, 30602, USA Yongho Jeon Department of Applied

More information

Forecasting Complex Time Series: Beanplot Time Series

Forecasting Complex Time Series: Beanplot Time Series COMPSTAT 2010 19 International Conference on Computational Statistics Paris-France, August 22-27 Forecasting Complex Time Series: Beanplot Time Series Carlo Drago and Germana Scepi Dipartimento di Matematica

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Discriminant Analysis for Interval Data

Discriminant Analysis for Interval Data Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

More information

Multilevel Clustering for large Databases

Multilevel Clustering for large Databases Multilevel Clustering for large Databases Yves Lechevallier and Antonio Ciampi INRIA-Rocquencourt, 7853 Le Chesnay CEDEX, France Department of Epidemiology & Biostatistics, McGill University, Montreal,

More information

A Nonparametric Kernel Approach to Interval-Valued Data Analysis

A Nonparametric Kernel Approach to Interval-Valued Data Analysis A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,

More information

Nonlinear Multivariate Statistical Sensitivity Analysis of Environmental Models

Nonlinear Multivariate Statistical Sensitivity Analysis of Environmental Models Nonlinear Multivariate Statistical Sensitivity Analysis of Environmental Models with Application on Heavy Metals Adsorption from Contaminated Wastewater A. Fassò, A. Esposito, E. Porcu, A.P. Reverberi,

More information

Basics of Multivariate Modelling and Data Analysis

Basics of Multivariate Modelling and Data Analysis Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques

More information

A Short Introduction to Curve Fitting and Regression by Brad Morantz

A Short Introduction to Curve Fitting and Regression by Brad Morantz A Short Introduction to Curve Fitting and Regression by Brad Morantz bradscientist@machine-cognition.com Overview What can regression do for me? Example Model building Error Metrics OLS Regression Robust

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017 Summary of Part II Key Concepts & Formulas Christopher Ting November 11, 2017 christopherting@smu.edu.sg http://www.mysmu.edu/faculty/christophert/ Christopher Ting 1 of 16 Why Regression Analysis? Understand

More information

Financial Econometrics

Financial Econometrics Financial Econometrics Multivariate Time Series Analysis: VAR Gerald P. Dwyer Trinity College, Dublin January 2013 GPD (TCD) VAR 01/13 1 / 25 Structural equations Suppose have simultaneous system for supply

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

Price Linkage and Transmission between Shippers and Retailers in the French Fresh Vegetable Channel. Daniel HASSAN

Price Linkage and Transmission between Shippers and Retailers in the French Fresh Vegetable Channel. Daniel HASSAN Price Linkage and Transmission between Shippers and Retailers in the French Fresh Vegetable Channel Daniel HASSAN e-mail: hassan@toulouse.inra.fr Michel SIMIONI e-mail: simioni@toulouse.inra.fr Paper prepared

More information

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1

Summary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1 Summary statistics 1. Visualize data 2. Mean, median, mode and percentiles, variance, standard deviation 3. Frequency distribution. Skewness 4. Covariance and correlation 5. Autocorrelation MSc Induction

More information

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author... From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. Contents About This Book... xiii About The Author... xxiii Chapter 1 Getting Started: Data Analysis with JMP...

More information

Forecasting with Interval and Histogram Data: Some Financial Applications

Forecasting with Interval and Histogram Data: Some Financial Applications 10 Forecasting with Interval and Histogram Data: Some Financial Applications Javier Arroyo, Gloria González-Rivera, and Carlos Maté CONTENTS 10.1 Introduction...248 10.2 Interval Data...251 10.2.1 Preliminaries...251

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

A Non-Parametric Approach of Heteroskedasticity Robust Estimation of Vector-Autoregressive (VAR) Models

A Non-Parametric Approach of Heteroskedasticity Robust Estimation of Vector-Autoregressive (VAR) Models Journal of Finance and Investment Analysis, vol.1, no.1, 2012, 55-67 ISSN: 2241-0988 (print version), 2241-0996 (online) International Scientific Press, 2012 A Non-Parametric Approach of Heteroskedasticity

More information

Semiparametric Cost Allocation Estimation

Semiparametric Cost Allocation Estimation Semiparametric Cost Allocation Estimation Daniel Wikström, Ludo Peeters and Yves Surry Swedish University of Agricultural Sciences, Department of Economics, Box 7013, 750 07 Uppsala, Sweden. Daniel.Wikstrom@slu.se

More information

Distribution-Free Monitoring of Univariate Processes. Peihua Qiu 1 and Zhonghua Li 1,2. Abstract

Distribution-Free Monitoring of Univariate Processes. Peihua Qiu 1 and Zhonghua Li 1,2. Abstract Distribution-Free Monitoring of Univariate Processes Peihua Qiu 1 and Zhonghua Li 1,2 1 School of Statistics, University of Minnesota, USA 2 LPMC and Department of Statistics, Nankai University, China

More information

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Components Analysis. Sargur Srihari University at Buffalo Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Probability Models for Bayesian Recognition

Probability Models for Bayesian Recognition Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian

More information

Modelling and Analysing Interval Data

Modelling and Analysing Interval Data Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Gaussian kernel GARCH models

Gaussian kernel GARCH models Gaussian kernel GARCH models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics 7 June 2013 Motivation A regression model is often

More information

Are Forecast Updates Progressive?

Are Forecast Updates Progressive? MPRA Munich Personal RePEc Archive Are Forecast Updates Progressive? Chia-Lin Chang and Philip Hans Franses and Michael McAleer National Chung Hsing University, Erasmus University Rotterdam, Erasmus University

More information

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES

REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes

More information

The Geography of French Agricultural Co-operatives: An Explanatory Spatial Data Analysis

The Geography of French Agricultural Co-operatives: An Explanatory Spatial Data Analysis The Geography of French Agricultural Co-operatives: An Explanatory Spatial Data Analysis Very first draft please do not quote Sébastien Chantelot 1 Stéphanie Pérès 2 Maryline Filippi 3 Summary The aim

More information

Non-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators

Non-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators Non-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators Stefano Marchetti 1 Nikos Tzavidis 2 Monica Pratesi 3 1,3 Department

More information

Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil ( )

Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil ( ) Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil (1981-2005) Ana Maria Hermeto Camilo de Oliveira Affiliation: CEDEPLAR/UFMG Address: Av. Antônio Carlos, 6627 FACE/UFMG Belo

More information

Reserving for multiple excess layers

Reserving for multiple excess layers Reserving for multiple excess layers Ben Zehnwirth and Glen Barnett Abstract Patterns and changing trends among several excess-type layers on the same business tend to be closely related. The changes in

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

 M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2 Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the

More information

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises

More information

Estimation of Costs of Production at Farm Level

Estimation of Costs of Production at Farm Level Estimation of Costs of Production at Farm Level Estimation of flexible cost functions using the EU-FADN database Bruno Henry de Frahan and Rembert De Blander Université catholique de Louvain EAAE Congress,

More information

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego

Model-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego To explain or to predict? Models are indispensable for exploring/utilizing relationships

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

SUMMARIZING MEASURED DATA. Gaia Maselli

SUMMARIZING MEASURED DATA. Gaia Maselli SUMMARIZING MEASURED DATA Gaia Maselli maselli@di.uniroma1.it Computer Network Performance 2 Overview Basic concepts Summarizing measured data Summarizing data by a single number Summarizing variability

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Bayesian Semiparametric GARCH Models

Bayesian Semiparametric GARCH Models Bayesian Semiparametric GARCH Models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics xibin.zhang@monash.edu Quantitative Methods

More information

G E INTERACTION USING JMP: AN OVERVIEW

G E INTERACTION USING JMP: AN OVERVIEW G E INTERACTION USING JMP: AN OVERVIEW Sukanta Dash I.A.S.R.I., Library Avenue, New Delhi-110012 sukanta@iasri.res.in 1. Introduction Genotype Environment interaction (G E) is a common phenomenon in agricultural

More information

Bayesian Semiparametric GARCH Models

Bayesian Semiparametric GARCH Models Bayesian Semiparametric GARCH Models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics xibin.zhang@monash.edu Quantitative Methods

More information

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain; CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.

More information

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment

Binary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment BINARY CHOICE MODELS Y ( Y ) ( Y ) 1 with Pr = 1 = P = 0 with Pr = 0 = 1 P Examples: decision-making purchase of durable consumer products unemployment Estimation with OLS? Yi = Xiβ + εi Problems: nonsense

More information

Diagnostics for Linear Models With Functional Responses

Diagnostics for Linear Models With Functional Responses Diagnostics for Linear Models With Functional Responses Qing Shen Edmunds.com Inc. 2401 Colorado Ave., Suite 250 Santa Monica, CA 90404 (shenqing26@hotmail.com) Hongquan Xu Department of Statistics University

More information

The General Linear Model (GLM)

The General Linear Model (GLM) he General Linear Model (GLM) Klaas Enno Stephan ranslational Neuromodeling Unit (NU) Institute for Biomedical Engineering University of Zurich & EH Zurich Wellcome rust Centre for Neuroimaging Institute

More information

Unit 10: Simple Linear Regression and Correlation

Unit 10: Simple Linear Regression and Correlation Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for

More information

Ridge Regression and Ill-Conditioning

Ridge Regression and Ill-Conditioning Journal of Modern Applied Statistical Methods Volume 3 Issue Article 8-04 Ridge Regression and Ill-Conditioning Ghadban Khalaf King Khalid University, Saudi Arabia, albadran50@yahoo.com Mohamed Iguernane

More information

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011 Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011 Outline Ordinary Least Squares (OLS) Regression Generalized Linear Models

More information

STK4900/ Lecture 5. Program

STK4900/ Lecture 5. Program STK4900/9900 - Lecture 5 Program 1. Checking model assumptions Linearity Equal variances Normality Influential observations Importance of model assumptions 2. Selection of predictors Forward and backward

More information

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on

More information

The geography of the French creative class: An exploratory spatial data analysis

The geography of the French creative class: An exploratory spatial data analysis The geography of the French creative class: An exploratory spatial data analysis Very First Draft, please do not quote Sébastien Chantelot 1 Stéphanie Pérès 2 Stéphane Virol 3 Abstract This paper analyses

More information

Comparative Efficiency of Lactation Curve Models Using Irish Experimental Dairy Farms Data

Comparative Efficiency of Lactation Curve Models Using Irish Experimental Dairy Farms Data Comparative Efficiency of Lactation Curve Models Using Irish Experimental Dairy Farms Data Fan Zhang¹, Michael D. Murphy¹ 1. Department of Process, Energy and Transport, Cork Institute of Technology, Ireland.

More information

Regression: Ordinary Least Squares

Regression: Ordinary Least Squares Regression: Ordinary Least Squares Mark Hendricks Autumn 2017 FINM Intro: Regression Outline Regression OLS Mathematics Linear Projection Hendricks, Autumn 2017 FINM Intro: Regression: Lecture 2/32 Regression

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation

More information

Accounting for measurement uncertainties in industrial data analysis

Accounting for measurement uncertainties in industrial data analysis Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos,

More information

Regression Models - Introduction

Regression Models - Introduction Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent

More information

4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models

4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models 4.1 Least Squares Prediction 4. Measuring Goodness-of-Fit 4.3 Modeling Issues 4.4 Log-Linear Models y = β + β x + e 0 1 0 0 ( ) E y where e 0 is a random error. We assume that and E( e 0 ) = 0 var ( e

More information

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems

Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems Modern Applied Science; Vol. 9, No. ; 05 ISSN 9-844 E-ISSN 9-85 Published by Canadian Center of Science and Education Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity

More information

Model Fitting. Jean Yves Le Boudec

Model Fitting. Jean Yves Le Boudec Model Fitting Jean Yves Le Boudec 0 Contents 1. What is model fitting? 2. Linear Regression 3. Linear regression with norm minimization 4. Choosing a distribution 5. Heavy Tail 1 Virus Infection Data We

More information

Lecture 5: A step back

Lecture 5: A step back Lecture 5: A step back Last time Last time we talked about a practical application of the shrinkage idea, introducing James-Stein estimation and its extension We saw our first connection between shrinkage

More information

Descriptive Statistics for Symbolic Data

Descriptive Statistics for Symbolic Data Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account

More information

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014 Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis

More information

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION

Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION INTRODUCTION Statistical disclosure control part of preparations for disseminating microdata. Data perturbation techniques: Methods assuring

More information

Generalization of the Principal Components Analysis to Histogram Data

Generalization of the Principal Components Analysis to Histogram Data Generalization of the Principal Components Analysis to Histogram Data Oldemar Rodríguez 1, Edwin Diday 1, and Suzanne Winsberg 2 1 University Paris 9 Dauphine, Ceremade Pl Du Ml de L de Tassigny 75016

More information

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

Estimation of Mars surface physical properties from hyperspectral images using the SIR method Estimation of Mars surface physical properties from hyperspectral images using the SIR method Caroline Bernard-Michel, Sylvain Douté, Laurent Gardes and Stéphane Girard Source: ESA Outline I. Context Hyperspectral

More information

Regional Technical Efficiency in Europe

Regional Technical Efficiency in Europe Regional Technical Efficiency in Europe Ron Moomaw Oklahoma State University Lee Adkins Oklahoma State University June 2000 Abstract Key Words: Technical Efficiency, Regional Efficiency, Production Frontier,

More information

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance

A Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance CESIS Electronic Working Paper Series Paper No. 223 A Bootstrap Test for Causality with Endogenous Lag Length Choice - theory and application in finance R. Scott Hacker and Abdulnasser Hatemi-J April 200

More information

ining Dissemination Analysis Coordination Coordination Production Production Annual Report lysis Analysis Dissemination Production Production

ining Dissemination Analysis Coordination Coordination Production Production Annual Report lysis Analysis Dissemination Production Production on raining Dissemination ination Analysis Coordination mination ion sis 2010 Analysis tion Production alysis Production ction Dissemination Production ining Annual Report ysis Coordination ination uction

More information

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1

Lecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1 Lecture Simple Linear Regression STAT 51 Spring 011 Background Reading KNNL: Chapter 1-1 Topic Overview This topic we will cover: Regression Terminology Simple Linear Regression with a single predictor

More information

Applied Econometrics. Professor Bernard Fingleton

Applied Econometrics. Professor Bernard Fingleton Applied Econometrics Professor Bernard Fingleton 1 Causation & Prediction 2 Causation One of the main difficulties in the social sciences is estimating whether a variable has a true causal effect Data

More information

POLSCI 702 Non-Normality and Heteroskedasticity

POLSCI 702 Non-Normality and Heteroskedasticity Goals of this Lecture POLSCI 702 Non-Normality and Heteroskedasticity Dave Armstrong University of Wisconsin Milwaukee Department of Political Science e: armstrod@uwm.edu w: www.quantoid.net/uwm702.html

More information

CHAPTER 5. Outlier Detection in Multivariate Data

CHAPTER 5. Outlier Detection in Multivariate Data CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for

More information

Cross-Sectional Regression after Factor Analysis: Two Applications

Cross-Sectional Regression after Factor Analysis: Two Applications al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016 Overview 1 2 3 4 1 / 27 Outline 1 2 3 4 2 / 27 Data matrix Y R n p Panel data. Transposable

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

Regional economic growth and environmental efficiency in greenhouse emissions: A conditional directional distance function approach

Regional economic growth and environmental efficiency in greenhouse emissions: A conditional directional distance function approach MPRA Munich Personal RePEc Archive Regional economic growth and environmental efficiency in greenhouse emissions: A conditional directional distance function approach George Halos and Nicolaos Tzeremes

More information

ECE 661: Homework 10 Fall 2014

ECE 661: Homework 10 Fall 2014 ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;

More information

Multivariate Lineare Modelle

Multivariate Lineare Modelle 0-1 TALEB AHMAD CASE - Center for Applied Statistics and Economics Humboldt-Universität zu Berlin Motivation 1-1 Motivation Multivariate regression models can accommodate many explanatory which simultaneously

More information