How to cope with modelling and privacy concerns? A regression model and a visualization tool for aggregated data
|
|
- Arline Newton
- 6 years ago
- Views:
Transcription
1 for aggregated data Rosanna Verde Antonio Irpino Dominique Desbois Second University of Naples Dept. of Political Sciences J. Monnet 1 of 30
2 Motivations and aims of the talk Motivation Regulation of official statistical institutes does not allow the diffusion of microdata for privacy-related purposes. In general, it is easier to obtain aggregated data of a set of individuals. Most of the modelling tools in statistics (e.g., regression) work on microdata and cannot be easily extended to macrodata. Methods In this talk, we show the use of a regression method developed, where both the explanatory and the response variables present quantile distributions as observations. A PCA method on quantile data is used in order to visualize relationships between the predicted distributions. Application The analysis has been performed on a dataset of economic indicators related to the specific cost of agriculture products in France regions. 2 of 30
3 DATA We observed data coming from RICA, the French Farm Accounting Data Network (FADN), aggregated in 22 metropolitan regions of France. CODE REGION CODE REGION 121 Île de France 162 Pays de la Loire 131 Champagne-Ardenne 163 Bretagne 132 Picardie 164 Poitou-Charentes 133 Haute-Normandie 182 Aquitaine 134 Centre 183 Midi-Pyrénées 135 Basse-Normandie 184 Limousin 136 Bourgogne 192 Rhônes-Alpes 141 Nord-Pas-de-Calais 193 Auvergne 151 Lorraine 201 Languedoc-Roussillon 152 Alsace 203 Provence-Alpes-Côte dazur 153 Franche-Comté 204 Corse 3 of 30
4 Economic indicators available for each region o o o o Y_TSC Total Specific Cost (TSC) of farm holdings, X_WHEAT the wheat output variable; X_PIG the pig output variable; X_MILKC - the cow milk output variable; The available data o Each region is described by the vector of the estimates of the 10 deciles of the distribution observed for each French region; Not-available information (for privacy concerns) o Raw data are not available: for each farm we do not know data about the four variables o o We do not know association structure within each region We do not know the number of farms observed for each region 4 of 30
5 An example of a row of the data table Y_TSC X_Wheat X_Pig X_Cmilk CDF_Plot Bretagne Histogram Bretagne Smoothed histogram Bretagne 5 of 30
6 The data table: CDFs (Cumulative distribution functions) and corresponding histograms 6 of 30
7 A first research question? It is possible to predict Y_TSC from the other variables Classic methods of regression cannot be used with this kind of data Proposal We may use Histogram-valued data analysis The regression for quantile functions: Verde-Irpino regression With each quantile function is associated a distribution Irpino-Verde regression is a novel method for the regression analysis of distributional data. 7 of 30
8 A regression model for histogram variables based on Wasserstein distance 8 of 30
9 A Regression model for histogram data Data = Model Fit + Residual Linear regression is a general method for estimating/describing association between a continuous outcome variable (dependent) and one or multiple predictors in one equation. Easy conceptual task with classic data But what does it means when dealing with histogram data? 0,5 0,4 0,3 0,2 0,1 0,3 0,4 0,2 0,1 0,15 0,45 0,3 0,1 Billard, Diday, IFCS 2006 Verde, Irpino, COMPSTAT 2010; CLADAG 2011 Dias, Brito, ISI of 30
10 Linear Regression Model for histogram data (Verde, Irpino, 2013) Given a histogram variable X, we search for a linear transformation of X which allows us to predict the histogram variable Y For example: given the histogram of the Y_TSC observed in a region, is it possible to predict the distribution of the Y_TSC using a linear combination of the predictor histogram variables? 10 of 30
11 Multiple regression model for quantile functions Our concurrent multiple regression model is: in matrix notation: p = 0 + i j ij + i j= 1 y () t β β x () t ε () t Yt () = Xt () β + ε() t Quantile functions associated with histogram/ distribution data This formulation is analogous to the functional linear model (Ramsay, Silverman, 2003) except for the constant β parameters and for the functions y i (t), x ij (t) which are quantile functions while each ε i (t) is a residual function (distribution?) for all i=1,, n. 11 of 30
12 Parameters estimation - LS method using Wasserstein distance According to the nature of the variables, for the parameters estimation, we propose to extend the Least Squares principle to the functional case using a typical metric between quantile functions: 1 p 2 εi ( y i(t), ŷi(t) ) = y β0 β i(t) j x ij(t) dt 0 j= 1 2 Squared error based on the Wasserstein l 2 distance between two quantile functions W 0 ( ) ( ) i j = i j d x,x F (t) F (t) dt 2 Wasserstein l 2 distance between two quantile functions 12 of 30
13 Fitting linear regression model Find a linear transformation of the quantile functions of x ij (for j=1,,p) in order to predict the quantile function of y i i.e.: The linear transformation is unique: the parameters β 0 estimated for all the x ij and y i distributions A first problem: yˆ ( t) = β + β x ( t) t [0, 1] i 0 j ij j= 1 p Only if β j > 0 a quantile function yˆ () t can be derived. and β j are In order to overcome this problem, we propose a solution based on the decomposition of the Wasserstein distance and on the NNLS algorithm. i 13 of 30
14 OLS estimate (Irpino and Verde, 2012) The quantile function can be decomposed as: x c () t = x + x () t where ij ij ij c x () t = x () t x is the centered quantile function ij ij ij Then, we propose the following regression model: p p c i β0 β j ij γ j ij i j= 1 j= 1 = y () t x x () t ( t) 0 t 1 yˆ () t i Using the Wasserstein distance it is possible to set up a OLS method that returns the two sets of coefficients (β 0,β j ; γ j ). Under a positiveness constraint on γ j 14 of 30
15 Interpretation of the parameters Regression parameters for the distribution mean locations ˆ β ˆ,..., ˆ 0 β1 β R, p Shrinking factors for the variability y ˆ 1,..., ˆp γ γ R > 1 (< 1) the histogram has a greater (smaller) variability than ˆi the x ij histogram of 30
16 Advantages of the regression on quantile functions The regression on quantile functions takes into account the whole distribution (described by the quantiles). It is more powerful with respect a classic regression on the means of the distributions because it considers information about sizes and shapes of the distributions. It is different from the well-know Quantile regression which requires all microdata and estimates one quantile at time (independently from the others). In this case it is not guaranteed the order among the estimated quantile. Our methods works on aggregated data when microdata are not available, and estimates the quantiles using a single model. The method guarantees the natural order among the estimated quantiles. The method suffers less of the outlying observations, thus it guarantees a more robust estimation of the tails of the distributions. 16 of 30
17 Regression results (only the first 19 regions are used, the last three have regressors equal to zero) The estimated model 16, c X _ WHEAT _ ( i + X WHEATi t) Y _TSC i () t = t [0;1] c X _ PIG X _ PIG () t Goodness of fit indices Root Mean Square Error (Verde & Irpino, 2013): i X _ MILKC X i c _ MILKC () t RMSE=7,238.2; Omega index (Dias & Brito, 2011): Ω = (0 worst fitting, 1 best fitting); Pseudo R-squared (Verde & Irpino, 2013): PR 2 = (0 worst fitting, 1 best fitting). i i 17 of 30
18 Plot of observed CDFs vs predicted CDFs Observed Predicted 18 of 30
19 Plot observed vs predicted (zoom) 19 of 30
20 A visualization tool for distributions Motivations A distribution, being a function, is a high dimensional data. We observed the plots in the last two slides: This kind of visualization is not very communicative. It is difficult to compare different distributions visually. We need a visualization tool that organizes graphically the distributions according to a similarity criterion. A new visualization tool: Quantile PCA (Irpino and Verde, 2013) Chosen a fixed number of quantiles, Quantile PCA performs a principal component analysis on a single distributional variable (a column of the data table). 20 of 30
21 PCA of quantiles The X matrix decomposed in Q-PCA We fix a set of m quantiles Each individual is represented by a sequence of m+1 (including the minimum value) ordered values xi = min( xi) Qi 1 Qij Qi, m 1 Max( xi) X min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) min( x ) Q Q Q Max( x ) = i i,1 i, j im, 1 i m in( xn) QN,1 QN, j QNm, 1 Max( xn) 21 of 30
22 Average quantiles vector The m+1 quantile column variables are centered min( x1) Q1,1 Q1, j Q1, m 1 Max( x1) X = min( xi) Qi,1 Qi, j Qim, 1 Max( xi) min( xn) QN,1 QN, j QNm, 1 Max( xn) x = min( x) Q1 Qj Qm 1 Max( x) X I N x = X c with I N the unitary vector of N elements Average quantiles A PCA on the variance-covariance matrix of quantiles is performed. Note: the trace of the covariance matrix is an approximation of a variance measure defined for a distributional-variable. 22 of 30
23 Eigenvalues and explained inertia Wasserstein-based Variance of the variable Y_TSC = x10 8 ; Trace of the quantile Variance-Covariance matrix = x10 8. Inertia % of explained % cum E x E x E x E x E x E x E x E x E x Eigenvalues E1 E2 E3 E4 E5 1,70E+07 2,23E+06 9,50E+05 2,70E+05 2,57E+08 Cum. perc. of explained inertia E1 E2 E3 E4 E5 E6 E7 E8 E9 Eigenvalues 23 of 30
24 The plot of variables: the Spanish-fan plot Median Upper quantiles Lower quantiles Comment: Great part of variability is due to differences on the right tail. (Right-skewness) 24 of 30
25 Principal Component Analysis of the Y_TSC variable: First factorial plane 25 of 30
26 PCA of the Y_TSC variable: First factorial plane (distribution colored according to the means) Higher mean Lower mean 26 of 30
27 PCA of the Y_TSC variable: First factorial plane (distribution coloured according to standard deviations) Lower std Higher std 27 of 30
28 PCA of the Y_TSC variable: First factorial plane a joint view Comment: means and stds seems slightly positive correlated, they are both related to right heavy tailed distributions 28 of 30
29 Conclusions In this talk, starting from aggregated data, and without knowing microdata, we showed that it is possible to analyze, predict and show such summary structures using new tools from distributional-valued data analysis, defined into a space of univariate distributions equipped with L2 Wasserstein metric. A regression technique is able to work with this kind of data and it provides accurate and interpretable (also for practitioners) results for the interpreting of the causal relationships. A PCA on quantiles is a promising tool for a fast and easy visualization of the different distribution features. An R package is going to be released in the next quarter. As a future work, a graphical analysis of predicted vs observed distribution-valued data can be introduced using more sophisticated factorial analysis. (This is in progress) 29 of 30
30 Main references 1. Arroyo, J., Maté, C.: Forecasting histogram time series with k-nearest neighbors methods, International Journal of Forecasting 25 (1), (2009) 2. Arroyo, J., González-Rivera, G., Maté C.: Forecasting with interval and histogram data. Some financial applications. Handbook of empirical economics and finance, (2010) 3. Dias, S., Brito, P.: A new linear regression model for histogram-valued variables, in: 58th ISI World Statistics Congress, Dublin, Ireland, URL: (2011) 4. Irpino, A., Verde, R.: Dimension reduction techniques for distributional symbolic data. In: SIS 2013 Statistical Conference Advances in Latent Variables - Methods, Models and Applications. URL: (2013) 5. Irpino, A. Verde, R. : Ordinary Least Squares for Histogram Data Based on Wasserstein Distance. In: COMPSTAT 2010, 19th Conference of IASC-ERS (Physica Verlag), pp (2010). 6. Irpino, A., Verde, R.: Basic statistics for distributional symbolic variables: a new metric-based approach. Advances in Data Analysis and Classification, doi: /s (in press, 2015) 7. Rüschendorf,, L.: Wasserstein metric, in Hazewinkel, M., Encyclopedia of Mathematics, Springer (2001) 8. Verde, R, Irpino, A.: Multiple Linear Regression for Histogram Data using Least Squares of Quantile Functions: a Two-components model. Revue des Nouvelles Technologies de L'Information, vol. RNTI- E-25, p (2013) 30 of 30
31 for aggregated data Thanks For Listening Antonio Irpino, Rosanna Verde, Dominique Desbois, November 25 th, of 30
Histogram data analysis based on Wasserstein distance
Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY Aims Introduce: New distances
More informationHistogram data analysis based on Wasserstein distance
Histogram data analysis based on Wasserstein distance Rosanna Verde Antonio Irpino Department of European and Mediterranean Studies Second University of Naples Caserta - ITALY SYMPOSIUM ON LEARNING AND
More informationOrder statistics for histogram data and a box plot visualization tool
Order statistics for histogram data and a box plot visualization tool Rosanna Verde, Antonio Balzanella, Antonio Irpino Second University of Naples, Caserta, Italy rosanna.verde@unina.it, antonio.balzanella@unina.it,
More informationGeo-Spatial Technologies for Customs From Information to Informed Actions. Tokyo, November 1 st 2017
Geo-Spatial Technologies for Customs From Information to Informed Actions Tokyo, November 1 st 2017 Geospatial Technologies Geographic Information Systems Remote Sensing Examples 2 Outline Geographic Information
More informationDistributions are the numbers of today From histogram data to distributional data. Javier Arroyo Gallardo Universidad Complutense de Madrid
Distributions are the numbers of today From histogram data to distributional data Javier Arroyo Gallardo Universidad Complutense de Madrid Introduction 2 Symbolic data Symbolic data was introduced by Edwin
More informationA new linear regression model for histogram-valued variables
Int. Statistical Inst.: Proc. 58th World Statistical Congress, 011, Dublin (Session CPS077) p.5853 A new linear regression model for histogram-valued variables Dias, Sónia Instituto Politécnico Viana do
More informationOkun s Law in the French Regions: A Cross-Regional Comparison
Okun s Law in the French Regions: A Cross-Regional Comparison Marie-Estelle Binet, François Facchini To cite this version: Marie-Estelle Binet, François Facchini. Okun s Law in the French Regions: A Cross-
More informationGradewood: Grading of timber for engineered wood products
Gradewood: Grading of timber for engineered wood products Alpo Ranta-Maunus VTT Finland COST E53 in Oslo, 19 May 2008 Project Objectives and Main Tasks VTT BUILDING AND TRANSPORT Technical results of project
More informationMallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data
Metodološki zvezki, Vol. 9, No. 2, 212, 17-118 Mallows L 2 Distance in Some Multivariate Methods and its Application to Histogram-Type Data Katarina Košmelj 1 and Lynne Billard 2 Abstract Mallows L 2 distance
More informationPrincipal Component Analysis for Interval Data
Outline Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account Outline Outline 1 Introduction to
More informationThe prediction of house price
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationLinear Regression Model with Histogram-Valued Variables
Linear Regression Model with Histogram-Valued Variables Sónia Dias 1 and Paula Brito 1 INESC TEC - INESC Technology and Science and ESTG/IPVC - School of Technology and Management, Polytechnic Institute
More informationReview of Statistics
Review of Statistics Topics Descriptive Statistics Mean, Variance Probability Union event, joint event Random Variables Discrete and Continuous Distributions, Moments Two Random Variables Covariance and
More informationForecasting 1 to h steps ahead using partial least squares
Forecasting 1 to h steps ahead using partial least squares Philip Hans Franses Econometric Institute, Erasmus University Rotterdam November 10, 2006 Econometric Institute Report 2006-47 I thank Dick van
More informationWorking Paper #37 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS
Working Paper #37 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS Cyril MARTIN de LAGARDE 12.2018 NETWORK CONNECTION SCHEMES FOR RENEWABLE ENERGY IN FRANCE: A SPATIAL ANALYSIS
More informationInterval-Based Composite Indicators
University of Rome Niccolo Cusano Conference of European Statistics Stakeholders 22 November 2014 1 Building Composite Indicators 2 (ICI) 3 Constructing ICI 4 Application on real data Composite Indicators
More informationAlternative management of insect pests on oilseed rape in winter and spring.
EPPO Workshop on integrated management of insect pests in oilseed rape JKI, Berlin, 2017-09-20/22 Alternative management of insect pests on oilseed rape in winter and spring. Laurent Ruck (1), Céline Robert
More informationA Resampling Approach for Interval-Valued Data Regression
A Resampling Approach for Interval-Valued Data Regression Jeongyoun Ahn, Muliang Peng, Cheolwoo Park Department of Statistics, University of Georgia, Athens, GA, 30602, USA Yongho Jeon Department of Applied
More informationForecasting Complex Time Series: Beanplot Time Series
COMPSTAT 2010 19 International Conference on Computational Statistics Paris-France, August 22-27 Forecasting Complex Time Series: Beanplot Time Series Carlo Drago and Germana Scepi Dipartimento di Matematica
More informationDescriptive Statistics
Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More informationDiscriminant Analysis for Interval Data
Outline Discriminant Analysis for Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationMultilevel Clustering for large Databases
Multilevel Clustering for large Databases Yves Lechevallier and Antonio Ciampi INRIA-Rocquencourt, 7853 Le Chesnay CEDEX, France Department of Epidemiology & Biostatistics, McGill University, Montreal,
More informationA Nonparametric Kernel Approach to Interval-Valued Data Analysis
A Nonparametric Kernel Approach to Interval-Valued Data Analysis Yongho Jeon Department of Applied Statistics, Yonsei University, Seoul, 120-749, Korea Jeongyoun Ahn, Cheolwoo Park Department of Statistics,
More informationNonlinear Multivariate Statistical Sensitivity Analysis of Environmental Models
Nonlinear Multivariate Statistical Sensitivity Analysis of Environmental Models with Application on Heavy Metals Adsorption from Contaminated Wastewater A. Fassò, A. Esposito, E. Porcu, A.P. Reverberi,
More informationBasics of Multivariate Modelling and Data Analysis
Basics of Multivariate Modelling and Data Analysis Kurt-Erik Häggblom 2. Overview of multivariate techniques 2.1 Different approaches to multivariate data analysis 2.2 Classification of multivariate techniques
More informationA Short Introduction to Curve Fitting and Regression by Brad Morantz
A Short Introduction to Curve Fitting and Regression by Brad Morantz bradscientist@machine-cognition.com Overview What can regression do for me? Example Model building Error Metrics OLS Regression Robust
More informationSubject CS1 Actuarial Statistics 1 Core Principles
Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and
More informationQuantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017
Summary of Part II Key Concepts & Formulas Christopher Ting November 11, 2017 christopherting@smu.edu.sg http://www.mysmu.edu/faculty/christophert/ Christopher Ting 1 of 16 Why Regression Analysis? Understand
More informationFinancial Econometrics
Financial Econometrics Multivariate Time Series Analysis: VAR Gerald P. Dwyer Trinity College, Dublin January 2013 GPD (TCD) VAR 01/13 1 / 25 Structural equations Suppose have simultaneous system for supply
More informationTable of Contents. Multivariate methods. Introduction II. Introduction I
Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation
More informationPrice Linkage and Transmission between Shippers and Retailers in the French Fresh Vegetable Channel. Daniel HASSAN
Price Linkage and Transmission between Shippers and Retailers in the French Fresh Vegetable Channel Daniel HASSAN e-mail: hassan@toulouse.inra.fr Michel SIMIONI e-mail: simioni@toulouse.inra.fr Paper prepared
More informationSummary statistics. G.S. Questa, L. Trapani. MSc Induction - Summary statistics 1
Summary statistics 1. Visualize data 2. Mean, median, mode and percentiles, variance, standard deviation 3. Frequency distribution. Skewness 4. Covariance and correlation 5. Autocorrelation MSc Induction
More informationFrom Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...
From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. Contents About This Book... xiii About The Author... xxiii Chapter 1 Getting Started: Data Analysis with JMP...
More informationForecasting with Interval and Histogram Data: Some Financial Applications
10 Forecasting with Interval and Histogram Data: Some Financial Applications Javier Arroyo, Gloria González-Rivera, and Carlos Maté CONTENTS 10.1 Introduction...248 10.2 Interval Data...251 10.2.1 Preliminaries...251
More informationStatistics Toolbox 6. Apply statistical algorithms and probability models
Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of
More informationA Non-Parametric Approach of Heteroskedasticity Robust Estimation of Vector-Autoregressive (VAR) Models
Journal of Finance and Investment Analysis, vol.1, no.1, 2012, 55-67 ISSN: 2241-0988 (print version), 2241-0996 (online) International Scientific Press, 2012 A Non-Parametric Approach of Heteroskedasticity
More informationSemiparametric Cost Allocation Estimation
Semiparametric Cost Allocation Estimation Daniel Wikström, Ludo Peeters and Yves Surry Swedish University of Agricultural Sciences, Department of Economics, Box 7013, 750 07 Uppsala, Sweden. Daniel.Wikstrom@slu.se
More informationDistribution-Free Monitoring of Univariate Processes. Peihua Qiu 1 and Zhonghua Li 1,2. Abstract
Distribution-Free Monitoring of Univariate Processes Peihua Qiu 1 and Zhonghua Li 1,2 1 School of Statistics, University of Minnesota, USA 2 LPMC and Department of Statistics, Nankai University, China
More informationPrincipal Components Analysis. Sargur Srihari University at Buffalo
Principal Components Analysis Sargur Srihari University at Buffalo 1 Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2
More informationVariable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1
Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationModelling and Analysing Interval Data
Modelling and Analysing Interval Data Paula Brito Faculdade de Economia/NIAAD-LIACC, Universidade do Porto Rua Dr. Roberto Frias, 4200-464 Porto, Portugal mpbrito@fep.up.pt Abstract. In this paper we discuss
More informationStatistical Data Analysis
DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the
More informationGaussian kernel GARCH models
Gaussian kernel GARCH models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics 7 June 2013 Motivation A regression model is often
More informationAre Forecast Updates Progressive?
MPRA Munich Personal RePEc Archive Are Forecast Updates Progressive? Chia-Lin Chang and Philip Hans Franses and Michael McAleer National Chung Hsing University, Erasmus University Rotterdam, Erasmus University
More informationREGRESSION DIAGNOSTICS AND REMEDIAL MEASURES
REGRESSION DIAGNOSTICS AND REMEDIAL MEASURES Lalmohan Bhar I.A.S.R.I., Library Avenue, Pusa, New Delhi 110 01 lmbhar@iasri.res.in 1. Introduction Regression analysis is a statistical methodology that utilizes
More informationThe Geography of French Agricultural Co-operatives: An Explanatory Spatial Data Analysis
The Geography of French Agricultural Co-operatives: An Explanatory Spatial Data Analysis Very first draft please do not quote Sébastien Chantelot 1 Stéphanie Pérès 2 Maryline Filippi 3 Summary The aim
More informationNon-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators
Non-parametric bootstrap mean squared error estimation for M-quantile estimates of small area means, quantiles and poverty indicators Stefano Marchetti 1 Nikos Tzavidis 2 Monica Pratesi 3 1,3 Department
More informationTrends in the Relative Distribution of Wages by Gender and Cohorts in Brazil ( )
Trends in the Relative Distribution of Wages by Gender and Cohorts in Brazil (1981-2005) Ana Maria Hermeto Camilo de Oliveira Affiliation: CEDEPLAR/UFMG Address: Av. Antônio Carlos, 6627 FACE/UFMG Belo
More informationReserving for multiple excess layers
Reserving for multiple excess layers Ben Zehnwirth and Glen Barnett Abstract Patterns and changing trends among several excess-type layers on the same business tend to be closely related. The changes in
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More information" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2
Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the
More informationMS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter
MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises
More informationEstimation of Costs of Production at Farm Level
Estimation of Costs of Production at Farm Level Estimation of flexible cost functions using the EU-FADN database Bruno Henry de Frahan and Rembert De Blander Université catholique de Louvain EAAE Congress,
More informationModel-free prediction intervals for regression and autoregression. Dimitris N. Politis University of California, San Diego
Model-free prediction intervals for regression and autoregression Dimitris N. Politis University of California, San Diego To explain or to predict? Models are indispensable for exploring/utilizing relationships
More informationData Preprocessing Tasks
Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can
More informationSUMMARIZING MEASURED DATA. Gaia Maselli
SUMMARIZING MEASURED DATA Gaia Maselli maselli@di.uniroma1.it Computer Network Performance 2 Overview Basic concepts Summarizing measured data Summarizing data by a single number Summarizing variability
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationBayesian Semiparametric GARCH Models
Bayesian Semiparametric GARCH Models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics xibin.zhang@monash.edu Quantitative Methods
More informationG E INTERACTION USING JMP: AN OVERVIEW
G E INTERACTION USING JMP: AN OVERVIEW Sukanta Dash I.A.S.R.I., Library Avenue, New Delhi-110012 sukanta@iasri.res.in 1. Introduction Genotype Environment interaction (G E) is a common phenomenon in agricultural
More informationBayesian Semiparametric GARCH Models
Bayesian Semiparametric GARCH Models Xibin (Bill) Zhang and Maxwell L. King Department of Econometrics and Business Statistics Faculty of Business and Economics xibin.zhang@monash.edu Quantitative Methods
More informationCoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;
CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.
More informationBinary Choice Models Probit & Logit. = 0 with Pr = 0 = 1. decision-making purchase of durable consumer products unemployment
BINARY CHOICE MODELS Y ( Y ) ( Y ) 1 with Pr = 1 = P = 0 with Pr = 0 = 1 P Examples: decision-making purchase of durable consumer products unemployment Estimation with OLS? Yi = Xiβ + εi Problems: nonsense
More informationDiagnostics for Linear Models With Functional Responses
Diagnostics for Linear Models With Functional Responses Qing Shen Edmunds.com Inc. 2401 Colorado Ave., Suite 250 Santa Monica, CA 90404 (shenqing26@hotmail.com) Hongquan Xu Department of Statistics University
More informationThe General Linear Model (GLM)
he General Linear Model (GLM) Klaas Enno Stephan ranslational Neuromodeling Unit (NU) Institute for Biomedical Engineering University of Zurich & EH Zurich Wellcome rust Centre for Neuroimaging Institute
More informationUnit 10: Simple Linear Regression and Correlation
Unit 10: Simple Linear Regression and Correlation Statistics 571: Statistical Methods Ramón V. León 6/28/2004 Unit 10 - Stat 571 - Ramón V. León 1 Introductory Remarks Regression analysis is a method for
More informationRidge Regression and Ill-Conditioning
Journal of Modern Applied Statistical Methods Volume 3 Issue Article 8-04 Ridge Regression and Ill-Conditioning Ghadban Khalaf King Khalid University, Saudi Arabia, albadran50@yahoo.com Mohamed Iguernane
More informationCopula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011
Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011 Outline Ordinary Least Squares (OLS) Regression Generalized Linear Models
More informationSTK4900/ Lecture 5. Program
STK4900/9900 - Lecture 5 Program 1. Checking model assumptions Linearity Equal variances Normality Influential observations Importance of model assumptions 2. Selection of predictors Forward and backward
More informationLINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises
LINEAR REGRESSION ANALYSIS MODULE XVI Lecture - 44 Exercises Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur Exercise 1 The following data has been obtained on
More informationThe geography of the French creative class: An exploratory spatial data analysis
The geography of the French creative class: An exploratory spatial data analysis Very First Draft, please do not quote Sébastien Chantelot 1 Stéphanie Pérès 2 Stéphane Virol 3 Abstract This paper analyses
More informationComparative Efficiency of Lactation Curve Models Using Irish Experimental Dairy Farms Data
Comparative Efficiency of Lactation Curve Models Using Irish Experimental Dairy Farms Data Fan Zhang¹, Michael D. Murphy¹ 1. Department of Process, Energy and Transport, Cork Institute of Technology, Ireland.
More informationRegression: Ordinary Least Squares
Regression: Ordinary Least Squares Mark Hendricks Autumn 2017 FINM Intro: Regression Outline Regression OLS Mathematics Linear Projection Hendricks, Autumn 2017 FINM Intro: Regression: Lecture 2/32 Regression
More information401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.
401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis
More informationExpression Data Exploration: Association, Patterns, Factors & Regression Modelling
Expression Data Exploration: Association, Patterns, Factors & Regression Modelling Exploring gene expression data Scale factors, median chip correlation on gene subsets for crude data quality investigation
More informationAccounting for measurement uncertainties in industrial data analysis
Accounting for measurement uncertainties in industrial data analysis Marco S. Reis * ; Pedro M. Saraiva GEPSI-PSE Group, Department of Chemical Engineering, University of Coimbra Pólo II Pinhal de Marrocos,
More informationRegression Models - Introduction
Regression Models - Introduction In regression models there are two types of variables that are studied: A dependent variable, Y, also called response variable. It is modeled as random. An independent
More information4.1 Least Squares Prediction 4.2 Measuring Goodness-of-Fit. 4.3 Modeling Issues. 4.4 Log-Linear Models
4.1 Least Squares Prediction 4. Measuring Goodness-of-Fit 4.3 Modeling Issues 4.4 Log-Linear Models y = β + β x + e 0 1 0 0 ( ) E y where e 0 is a random error. We assume that and E( e 0 ) = 0 var ( e
More informationUsing Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity and Outliers Problems
Modern Applied Science; Vol. 9, No. ; 05 ISSN 9-844 E-ISSN 9-85 Published by Canadian Center of Science and Education Using Ridge Least Median Squares to Estimate the Parameter by Solving Multicollinearity
More informationModel Fitting. Jean Yves Le Boudec
Model Fitting Jean Yves Le Boudec 0 Contents 1. What is model fitting? 2. Linear Regression 3. Linear regression with norm minimization 4. Choosing a distribution 5. Heavy Tail 1 Virus Infection Data We
More informationLecture 5: A step back
Lecture 5: A step back Last time Last time we talked about a practical application of the shrinkage idea, introducing James-Stein estimation and its extension We saw our first connection between shrinkage
More informationDescriptive Statistics for Symbolic Data
Outline Descriptive Statistics for Symbolic Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account
More informationPrincipal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014
Principal Component Analysis and Singular Value Decomposition Volker Tresp, Clemens Otte Summer 2014 1 Motivation So far we always argued for a high-dimensional feature space Still, in some cases it makes
More informationNonparametric Methods
Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of Pennsylvania July 28, 2009 Michael R. Roberts Nonparametric Methods 1/42 Overview Great for data analysis
More informationEric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION
Eric Shou Stat 598B / CSE 598D METHODS FOR MICRODATA PROTECTION INTRODUCTION Statistical disclosure control part of preparations for disseminating microdata. Data perturbation techniques: Methods assuring
More informationGeneralization of the Principal Components Analysis to Histogram Data
Generalization of the Principal Components Analysis to Histogram Data Oldemar Rodríguez 1, Edwin Diday 1, and Suzanne Winsberg 2 1 University Paris 9 Dauphine, Ceremade Pl Du Ml de L de Tassigny 75016
More informationEstimation of Mars surface physical properties from hyperspectral images using the SIR method
Estimation of Mars surface physical properties from hyperspectral images using the SIR method Caroline Bernard-Michel, Sylvain Douté, Laurent Gardes and Stéphane Girard Source: ESA Outline I. Context Hyperspectral
More informationRegional Technical Efficiency in Europe
Regional Technical Efficiency in Europe Ron Moomaw Oklahoma State University Lee Adkins Oklahoma State University June 2000 Abstract Key Words: Technical Efficiency, Regional Efficiency, Production Frontier,
More informationA Bootstrap Test for Causality with Endogenous Lag Length Choice. - theory and application in finance
CESIS Electronic Working Paper Series Paper No. 223 A Bootstrap Test for Causality with Endogenous Lag Length Choice - theory and application in finance R. Scott Hacker and Abdulnasser Hatemi-J April 200
More informationining Dissemination Analysis Coordination Coordination Production Production Annual Report lysis Analysis Dissemination Production Production
on raining Dissemination ination Analysis Coordination mination ion sis 2010 Analysis tion Production alysis Production ction Dissemination Production ining Annual Report ysis Coordination ination uction
More informationLecture 2 Simple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: Chapter 1
Lecture Simple Linear Regression STAT 51 Spring 011 Background Reading KNNL: Chapter 1-1 Topic Overview This topic we will cover: Regression Terminology Simple Linear Regression with a single predictor
More informationApplied Econometrics. Professor Bernard Fingleton
Applied Econometrics Professor Bernard Fingleton 1 Causation & Prediction 2 Causation One of the main difficulties in the social sciences is estimating whether a variable has a true causal effect Data
More informationPOLSCI 702 Non-Normality and Heteroskedasticity
Goals of this Lecture POLSCI 702 Non-Normality and Heteroskedasticity Dave Armstrong University of Wisconsin Milwaukee Department of Political Science e: armstrod@uwm.edu w: www.quantoid.net/uwm702.html
More informationCHAPTER 5. Outlier Detection in Multivariate Data
CHAPTER 5 Outlier Detection in Multivariate Data 5.1 Introduction Multivariate outlier detection is the important task of statistical analysis of multivariate data. Many methods have been proposed for
More informationCross-Sectional Regression after Factor Analysis: Two Applications
al Regression after Factor Analysis: Two Applications Joint work with Jingshu, Trevor, Art; Yang Song (GSB) May 7, 2016 Overview 1 2 3 4 1 / 27 Outline 1 2 3 4 2 / 27 Data matrix Y R n p Panel data. Transposable
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different
More informationRegional economic growth and environmental efficiency in greenhouse emissions: A conditional directional distance function approach
MPRA Munich Personal RePEc Archive Regional economic growth and environmental efficiency in greenhouse emissions: A conditional directional distance function approach George Halos and Nicolaos Tzeremes
More informationECE 661: Homework 10 Fall 2014
ECE 661: Homework 10 Fall 2014 This homework consists of the following two parts: (1) Face recognition with PCA and LDA for dimensionality reduction and the nearest-neighborhood rule for classification;
More informationMultivariate Lineare Modelle
0-1 TALEB AHMAD CASE - Center for Applied Statistics and Economics Humboldt-Universität zu Berlin Motivation 1-1 Motivation Multivariate regression models can accommodate many explanatory which simultaneously
More information