Normalisation with respect to pattern

Similar documents

µ and π p i.e. Point Estimation x And, more generally, the population proportion is approximately equal to a sample proportion

1 Lesson 6: Measure of Variation

Statistics 511 Additional Materials

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

1 Inferential Methods for Correlation and Regression Analysis

Final Examination Solutions 17/6/2010

577. Estimation of surface roughness using high frequency vibrations

Efficient GMM LECTURE 12 GMM II

Chapter 2 Descriptive Statistics

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

A statistical method to determine sample size to estimate characteristic value of soil parameters

Chapter 6 Part 5. Confidence Intervals t distribution chi square distribution. October 23, 2008

Properties and Hypothesis Testing

Lecture 24 Floods and flood frequency

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Statistical Fundamentals and Control Charts

Data Analysis and Statistical Methods Statistics 651

Lecture 2: Monte Carlo Simulation

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Chapter 23: Inferences About Means

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Anna Janicka Mathematical Statistics 2018/2019 Lecture 1, Parts 1 & 2

Output Analysis (2, Chapters 10 &11 Law)

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

NCSS Statistical Software. Tolerance Intervals

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Correlation Regression

Algebra of Least Squares

A Decomposition of the Herfindahl Index of Concentration

Random Variables, Sampling and Estimation

6 Sample Size Calculations

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Chi-Squared Tests Math 6070, Spring 2006

A New Conception of Measurement Uncertainty Calculation

Binomial Distribution

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

CURRICULUM INSPIRATIONS: INNOVATIVE CURRICULUM ONLINE EXPERIENCES: TANTON TIDBITS:

R. van Zyl 1, A.J. van der Merwe 2. Quintiles International, University of the Free State

Sample Size Determination (Two or More Samples)

Kolmogorov-Smirnov type Tests for Local Gaussianity in High-Frequency Data

Sampling Distributions, Z-Tests, Power

Simple Linear Regression

Estimation of Population Mean Using Co-Efficient of Variation and Median of an Auxiliary Variable

Economics Spring 2015


Chi-squared tests Math 6070, Spring 2014

Chimica Inorganica 3

Study on Coal Consumption Curve Fitting of the Thermal Power Based on Genetic Algorithm

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Topic 9: Sampling Distributions of Estimators

Soo King Lim Figure 1: Figure 2: Figure 3: Figure 4: Figure 5: Figure 6: Figure 7:

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

OBJECTIVES. Chapter 1 INTRODUCTION TO INSTRUMENTATION FUNCTION AND ADVANTAGES INTRODUCTION. At the end of this chapter, students should be able to:

6.3 Testing Series With Positive Terms

The standard deviation of the mean

On an Application of Bayesian Estimation

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 9

Chapter 8: Estimating with Confidence

11 Correlation and Regression

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

1 Introduction to reducing variance in Monte Carlo simulations

11 Hidden Markov Models

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Topic 18: Composite Hypotheses

Stat 421-SP2012 Interval Estimation Section

A Cobb - Douglas Function Based Index. for Human Development in Egypt

Mathacle. PSet Stats, Concepts In Statistics Level Number Name: Date:

MA238 Assignment 4 Solutions (part a)

Bivariate Sample Statistics Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 7

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Cov(aX, cy ) Var(X) Var(Y ) It is completely invariant to affine transformations: for any a, b, c, d R, ρ(ax + b, cy + d) = a.s. X i. as n.

Paired Data and Linear Correlation

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

ECON 3150/4150, Spring term Lecture 3

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

4 Multidimensional quantitative data

Module 1 Fundamentals in statistics

Equivalence Between An Approximate Version Of Brouwer s Fixed Point Theorem And Sperner s Lemma: A Constructive Analysis

G. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan

Modeling and Estimation of a Bivariate Pareto Distribution using the Principle of Maximum Entropy

Output Analysis and Run-Length Control

4.1 SIGMA NOTATION AND RIEMANN SUMS

RAINFALL PREDICTION BY WAVELET DECOMPOSITION

Groupe de Recherche en Économie et Développement International. Cahier de Recherche / Working Paper 10-18

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

The Poisson Process *

Overview of Estimation

Gini Index and Polynomial Pen s Parade

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Coefficient of variation and Power Pen s parade computation

Transcription:

Normalisatio with respect to patter Iwoa Müller-Fr czek Nicolaus Copericus Uiversity i Toru«, Polad Abstract The article presets a ew ormalisatio method of diagostic variables - ormalisatio with respect to the patter. The ormalisatio preserves some importat descriptive characteristics of variables: skewess, kurtosis ad the Pearso correlatio coeciets. It is particularly useful i dyamical aalysis, whe we work with the whole populatio of objects ot a sample, for example i regioal studies. After proposed trasformatio variables are comparable ot oly betwee themselves but also across time. The we ca use them, for example, to costruct composite variables. keywords: ormalisatio, stadardisatio, composite variable, sythetic measure 1 Itroductio I regioal studies we ofte eed to compare regios objects with respect to aalyzed complex or composite pheomeo. Complex pheomeo is a qualitative pheomeo, that is characterized by some quatitative features, called diagostic variables. Each object is the idetied with a poit of the multidimesioal real space. Oe of the tools of regioal research are composite variable or sythetic measure. Composite variable is created to reect multidimesioal poits objets i the oe-dimesioal space. May advaced methods of costructig sythetic variables have bee developed, however the simplest methods are ofte used i practice. There Author's Address: I. Müller-Fr czek, Faculty of Ecoomic Scieces ad Maagemet, Nicolaus Copericus Uiversity, ul. Gagaria 13 a, 87-100 Toru«, Polad, e-mail: muller@umk.pl 1

are a lot of such examples see [2], oe of them is very popular Huma Developmet Idex HDI, which raks coutries ito four tiers of socio-ecoomic developmet. Util 2010 HDI was a uiformly weighted sum of three idicators describig: life expectacy, educatio, ad icome per capita. Oe of the step of the costructio of sythetic measure is brigig diagostic variables to comparability, called ormalisatio or stadardisatio. Normalisatio deprives variables their uits ad uies their rages. There are a lot of ormalisatio formulas see [4], [5], [8]. Choosig a proper method is importat because ormalisatio iueces o results of object orderig. The usual stochastic approach ca be used to determie parameters eeded to ormalisatio. The we treat values of variable observatios as a radomly selected sample of the populatio. This approach should ot be used i regioal research, where we work with the whole populatio of objects. I this case we should use a descriptive determiistic approach. Normalisatio formulas are most ofte give for static aalysis, this is for a xed poit i time. A ormalisatio problem appears whe we wat to compare situatios of regios at several time poits. The the variables should also be comparable across time. To achieve this eect i the stochastic approach oe ca use all values of variable both for objects ad for time to determie parameters eeded for ormalisatio. However, this solutio is cotroversial i descriptive approach see [9], i additio, it requires icesat coversio of results whe later observatios occur. I this case we should rather use curret observatios, so after usual ormalisatio variables are ot comparable across time. The we ca ot compare the values of sythetic measures, we ca oly compare rakigs. To solve this problem i the metioed Huma Developmet Idex, the parameters of feature scalig are xed o levels, that are ot related to variable distributio. The levels are justied by substative reasos. For example, the age of 85 was established as the maximum life expectacy at birth. The article proposes a ew method of feature ormalisatio - ormalisatio with respect to the patter or patter ormalisatio for short. This ame was ispired by the Hellwig's paper see [3], [1]. The method is cosistet with the static approach, but it ca be used to compare objects at dieret time poits. The method meets the requiremets of ormalisatio that are suggested i literature see [4], [6]. It preserves skewess ad kurtosis. Moreover, the absolute values of the Pearso correlatio coeciets are ot chaged after ormalisatio. I the rs step of the patter ormalisatio the ature of variable is determied i the cotext of aalyzed complex pheomeo. We distiguish stimulats ad destimulats. Stimulat is a diagostic variable that has a 2

positive impact o the aalyzed complex pheomeo, while destimulat egative. I regioal research determiig the ature of variables is atural. Most ofte, before ormalisatio, we tur destimulats ito stimulats usig their iverse values. Ufortuately, the variables after coversio lose their iterpretatio ad their distributios are chaged. I the preseted method, we do ot coverse destimulat before ormalisatio. Destimulats ad stimulats are ormalized i dieret ways. Determiig the ature of variable allows us to choose the most beecial observatio amog all values of the variable, maximum for stimulat ad miimum for destimulat. We call this value a patter. Next we covert all values with respect to this patter. After trasformatio we get comparable variables. All of them are destimulats with clear iterpretatio. Patter ormalisatio ca be used i commo costructiio of composite variables istead of other methods of ormalisatio. A possible applicatio is show i [7]. 2 Deitio of patter ormalisatio Suppose that a complex pheomeo observed for N regios is aalyzed. Assume that we caot measure this pheomeo, whereas we kow a collectio of measurable diagostic variables that characterize it. Assume that diagostic variables meet both substative ad statistical requiremets, for more details see for example [9]. Let us cosider oe such variable x x 1, x 2,..., x R, which is a stimulat the we write x S, S deotes the set of stimulats or a destimulat x D aalogously. I the rst step we choose a patter - the most beecial of all values of the variable x. The patter is uique for all objects ad is described by the formula: max x i if x S, x + i 1 mi x i if x D. i After specifyig the patter x + we ca cosider a ew variable u + istead of the variable x give by: x u + x i x + + x if x S, i j1 x j x + j1 x+ x j x i x + 2 j1 x if x D. j x + 3

The formula 2 determies a certai trasformatio of iitial variable x x 1, x 2,..., x ito a ew variable u + u + 1, u + 2,..., u +. We call it a ormalisatio with respect to the patter. After this trasformatio the ew variable describes the same aspect of complex pheomeo as described by x. So u + is a diagostic variable of this pheomeo. The patter ormalisatio 2 is ot just a techical procedure. New variable has a clear iterpretatio, u + i species the share of distace betwee the i-th object ad the patter i the total distace of all objects from the patter. The situatio of the i-th object is better whe the value of u + i is lower. The values of variable u + characterize the positios of objects i the whole system. This is the same as for other forms of ormalisatio, but the system is specied i a dieret way. I the case of the patter ormalisatio the system is represeted by the sum of distaces betwee objects ad patter, while i commo ormalisatios descriptive characteristics of the distributio of x are used for this purpose. 3 Properties of variable after ormalisatio The quatitative descriptio of a immeasurable qualitative pheomeo is obtaied usig sythetic measures. Brigig diagostic variables to comparability is the rst step i the costructio of such measure. The patter ormalisatio ca be used for this purpose. Assume that diagostic variables are trasformed with respect to their patters. The the ew set of variables has advatages, which are expected for creatig sythetic variables. These properties ad some proofs are preseted below. A. Basic properties A1. All variables after patter ormalisatio are uitless, o-egative ad limited to iterval [0, 1]. Because of that, the ew set of diagostic variables cotais comparable elemets. A2. Irrespective of the iitial ature, variable after the patter ormalisatio becomes destimulat. It meas that the situatio of the i-th object is better whe the value u + i is lower. I this sese the patter ormalisatio uies the ature of diagostic variables. A3. Trasformig of variables does ot aect the orderig of objects. B. Extreme values after patter ormalisatio 4

B1. The variable u + ca take zero value oly for the patter object. Sice the patter is chose amog values of the variable x, zero value is take. u + i 0 x i x +. u + i 0 x i x + j1 x j x + 0 x i x + 0 x i x + B2. The value u + i equals 1 whe all objects are patters except the i-th object. This situatio is rather urealistic. u + i 1 j i x j x +. u + i 1 x i x + j1 x j x + 1 x i x + x j x + x j x + j i j1 B3. The maximum value of u + depeds o the ature of variable x ad it is expressed by: If x S, the: max i u + i max u + i max ix + x i i j1 x+ x j If x D, the: max i x i mi i x i j1 max i x i x j max i x i mi i x i j1 x j mi i x i x+ mi i x i j1 x+ x j max u + i max ix i x + i j1 x j x + max i x i x + j1 x j x + if x S, if x D. max i x i mi i x i j1 max i x i x j. max i x i mi i x i j1 x j mi i x i. 5

C. Descriptive characteristics of ormalised variables C1. The mea value of u + depeds oly o the umber of objects ad is iversely proportioal to this umber. It is expressed by: u + 1 u + def 1 u + i 1. x i x + j1 x j x + 1 x i x + j1 x j x + 1 C2. The variace of u + is described by: S 2 u + 1 S 2 u + def 1 u + i u + 2 x + x i j1 x+ x j 1 If x S, the: 2 S 2 u + 1 x + x j1 x+ x j 1 2 1 3 1 x + 2 x i 3 x + x 1 1 3 S 2 x 2 x + x 2 The proof is similar whe x D. S 2 x 2 x + x 2. x + x x + 1 j1 x 1 j 2 1 x x i 2 2 x + x 2 x xi x + x C3. The stadard deviatio of u + depeds o the ature of variable x ad it is expressed by: Su + def Sx if x S, S 2 u + x + x Sx if x D. x x + 6 2

C4. The coeciet of variatio of u + is give by: CV u + def Su+ u + C5. The 3-rd cetral momet of u + is give by: µ 3 u + def 1 Sx if x S, x + x Sx if x D. x x + u + i u + 3 µ 3 x 3 x + x 3. µ 3 u + 1 x + x i j1 x+ x j 1 3 If x S, the: µ 3 u + 1 x + x j1 x+ x j 1 3 1 x + x 4 x + 1 j1 x j 1 x + 3 x i 4 x + x 1 1 3 xi x 4 x x + µ 3 x 3 x x + 3 1 3 The proof is similar whe x D. C6. The absolute value of the coeciet of skewess does ot chage after the patter ormalisatio: { Au + def µ 3u + S 3 u + Ax if x S, Ax if x D. C7. The 4-th cetral momet of u + is give by: µ 4 u + def 1 u + i u + 4 µ 4 x 4 x + x 4. 7

µ 4 u + 1 x + x j1 x+ x j 1 4 If x S, the: µ 4 u + 1 x + x j1 x+ x j 1 4 1 5 1 4 xi x 1 5 x x + 5 µ 4 x 3 x x + 4 x xi x + x 1 x + x i x + 1 j1 x j 4 1 4 The proof is similar whe x D. C8. The kurtosis of u + does ot chage after the patter ormalisatio: Ku + def µ 4u + S 4 u + Kx. D. Liear relatio betwee variables after ormalisatio Assume that two diagostics variables x 1, x 2 are trasformed with respect to their patters. Deote by u + 1 ad u + 2 variables after ormalisatio. D1. The covariace betwee u + 1 ad u + 2 equals: covu 2 1, u + 2 def 1 u + i1 u+ 1 u + i2 u+ 2 covx 1, x 2 2 x + 1 x 1 x + 2 x 2 covx 1, x 2 2 x + 1 x 1 x + 2 x 2 if x 1, x 2 S or x 1, x 2 D, otherwise. covu 2 1, u + 2 1 x i1 x + 1 j1 x j1 x + 1 1 x i2 x + 2 j1 x j2 x + 2 1 8

Assume that x 1 ad x 2 are stimulats. The proof i other cases is similar. covu 2 1, u + 2 1 x + 1 x 1 j1 x j1 x + 1 1 x + 2 x 2 j1 x j2 x + 2 1 1 x + 1 x i1 x + 2 x 2 3 x + 1 1 j1 x 1 j1 x + 2 1 j1 x 1 j2 1 x + 1 x i1 x + 3 x + 1 2 x i2 1 x 1 x + 1 2 x 2 1 x1 x i1 3 x + x2 1 x i2 1 x 1 x + x i1 x 1 x i2 x 2 2 x 2 2 x + 1 x 1 x + 2 x 2 covx 1, x 2 2 x + 1 x 1 x + 2 x 2 D2. The absolute value of the Pearso correlatio coeciet of diagostic variables is preserved after the ormalisatio: { corru + 1, u + 2 def covu2 1, u + 2 Su + 1 Su + 2 corrx 1, x 2 if x 1, x 2 S or x 1, x 2 D, corrx 1, x 2 otherwise. E. Dyamic approach Assume that the diagostic variable x is observed i two periods of time the we write x 1 ad x 2 respectively. For each period we choose a patter ad trasform x 1 ad x 2 ito u 1+ ad u 2+ accordig to the formula 2. E1. The values of variables u 1+ ad u 2+ are comparable. Substatiatio. The system is characterized by the sum of distaces betwee objects ad the patter. It chages over time. For give object, if the value of the trasformed variable icreases over time, this meas that the share of distace from this object to the patter i the sum of all distaces icreases, so the situatio of this object becomes worse i compariso with the situatios of other objects. 9

4 Summary The ormalisatio of diagostic variables described by formula 2 plays a double role i the costructio of sythetic measure. First, it uies the ature of variables A2. Secodly, it brigs variables to comparability A1. So, after patter ormalisatio diagostic variables become comparable destimulats. The ormalisatio with respect to the patter preserves two importat characteristics of the distributio of diagostic variables - skewess C6 ad kurtosis C8. Moreover, this coversio does ot disrupt liear relatio betwee variables - the absolute value of the Pearso correlatio coeciet is ot chaged D2. This advatages are expected for ormalisatios used for brigig variables to comparability. Ulike other methods the patter ormalisatio is ot just a techical procedure, it has clear iterpretatio. However, the major advatage of the patter ormalisatio over other ormalisatio methods appears i dyamic approach. Although the curret data are the sole data used to covert variables, the trasformed variables are comparable i time E1. The ormalisatio with respect to the patter seems to be a useful tool i multidimesioal comparative aalysis. It ca be applied wheever variables eed to be comparable, for example i the sythetic aalysis of complex pheomeo. The proposed costructio ca have various modicatios, for example we ca chage the measure of distace or the method of choosig patter. Refereces [1] FANCHETTE, S. 1972 "Sychroic ad diachroic approaches i the Uesco project o huma resources idicators - Wroclaw taxoomy ad bivariate diachroic aalysis", UNESCO documet, SHS/WS/209, Paris. [2] FREUDENBERG, M. 2003, "Composite Idicators of Coutry Performace: A Critical Assessmet", OECD Sciece, Techology ad Idustry Workig Papers, No. 2003/16, OECD Publishig, Paris. [3] HELLWIG, Z. 1968, "Procedure of Evaluatig High-Level Mapower Data Ad Typology of Coutries by Meas of the Taxoomic Method", upublished UNESCO workig paper, COM/WS/91, Paris. [4] JAJUGA, K., WALESIAK, M. 2000, "Stamdardisatio of Data Set Uder Dieret Measuremet Scales", i Classicatio ad Iformatio Processig at the Tur of the Milleium. Studies i Classicatio, 10

Data Aalysis, ad Kowledge Orgaizatio, eds. Decker R., Gaul W., Spriger-Verlag, Berli, Heidelberg, 105-112. [5] MILLIGAN, G.W., COOPER, M.C. 1988, "A Study of Stadardizatio of Variables i Cluster Aalysis", Joural of Classicatio 5, 181-204. [6] MŠODAK, A. 2006, "Multirateral Normalisatios of Diagostic Features", Statistics I Trasitio 75, 1125-1139. [7] MÜLLER-FR CZEK, I. 2017, "Propozycja miary sytetyczej" [Propositio of Sythetic Measure], Przegl d Statystyczy, 644, 413-428. [8] STEINLEY, D. 2004, "Stadardizig Variables i K -meas Clusterig" i Classicatio, Clusterig, ad Data Miig Applicatios. Studies i Classicatio, Data Aalysis, ad Kowledge Orgaisatio, eds. Baks D., McMorris F.R., Arabie P., Gaul W., Spriger, Berli, Heidelberg. [9] ZELIA A. 2002, "Some Notes of the Selectio of Normalisatio of Diagostic Variables", Statistics I Trasitio 55, 787-802. 11