REGRESSION WITH TRUNCATED DATA

Size: px

Start display at page:

Download "REGRESSION WITH TRUNCATED DATA"

Elijah Lambert
5 years ago
Views:

1 ROBUST 2004 c JČMF 2004 REGRESSION WITH TRUNCATED DATA Marek Brabec Keywords: Regression, truncated data, spectral estimation. Abstract:Inthispaper,wewillpresentanexampleoftheanalysisofhistorical height data. First, we will discuss some conceptual problems related to the fact that no representable surveys are available for historical periods of interests(18th and 19th centuries). Then we will state a statistical model that can be used to correct available data(swedish soldiers measurements) for their selectivity with respect to the general population height distribution. The model is based on truncated normal regression. There, we will concentrate on periodicity in the height data, whose time series is spanning more than a century. In addition to presenting some explorative spectral estimates, we will discuss some problems and features related to maximum likelihood estimation in the model. 1 Introduction In this paper, we will present an interesting practical problem encountered by anthropologists and historians. The goal is to estimate dynamics of human population mean heights in the past, from historical data(18th and 19thcenturies).Asonecanexpect,thetaskiscomplicatedbythefactthat informative data of high quality are difficult to obtain. Namely, no reasonably representative height surveys were organized then. Small samples are available from historical records here and there(e.g. family records of aristocrats etc.), their relevance for population height distribution is dubious, to say the very least, however. On the other hand, large amounts of systematically and rather precisely measured data are available from military records. Since the soldiers came from various social strata, geographical areas etc., historians think that military drafts spatially covered the concurrent(healthyadultmale)populationtoareasonableextent,[8],[9].evenifwetake this for granted, one substantial problem remains, however. To illustrate it, we consider the fact that while the recent adult population heights are known to follow normal distribution rather closely and there is no substantial reason why their historical counterparts should not behave similarly, the military sample shows consistently substantial positive skewness. Boxplots in Figure 1 demonstrate the situation for one particular sub-group of soldiers(born in Swedish rural areas) from the dataset that was collected by Richard Steckel and Lars Sandberg,[7], and which we will analyze subsequently. Why is this happening? The answer is simple: military sample, as it stands, is not representative for the healthy male population as a whole. Itishighlyselectiveinthesensethatthearmydislikedshortmen.Itspreference for taller soldiers was embodied in a simple policy: men shorter than

2 34 Marek Brabec height in cm birth year Figure 1: Height of rural-born soldiers,distributional summaries by birthyear. a prescribed value(minimum height requirement, MHR) should not be drafted.therefore,onlythoseatorabovemhrshouldappearinthemilitary sample, theoretically. Comparing boxplots in Figure 1 with a presupposed MHRof165.76cm(horizontalline),wecanseethatwhilethepolicyreally ledtotheapparentrightskew,therearesomedatabelowmhraswell.we checked with prof. Komlos(economic historian from University of Munich, who introduced us to the substantive problem of historical height distribution estimation) and made sure that these are not coding errors. Although it was educationaltolearnthat(eveninkingdomodsweden...)afewindividuals maintained to get into the draft while being(even seriously) below the MHR, one quickly realizes that while data obtained from these individuals can tell something about how effectively the policy was implemented, they do not tell much about the general male population distribution. Therefore, it became customary among historians to discard them and analyze only those above MHR,[6]. While saturated (i.e. on year-by-year basis) analyses for this data type have been attempted,[6],(including analyses of this particular dataset[7]), current interest of economic and anthropometric historians focuses on analysis of some generalizable long-term properties of the yearly(indexed by birthyear) height time series(like trends, periodicity etc.). Here, we will focus on the periodicity properties(after some simple trend and inter-regional

3 Regression with truncated data 35 corrections).thiswasthetaskwhichwasbroughttousbyourcustomer (prof. Komlos), who was interested in getting picture of the periodicity properties in an explorative style. This interest is connected to the general attention that economic historians pay to height(and other anthropological variables that thay see as possible indicators of biological standard of living ), see e.g.[11]. The ultimate idea is that biological changes(like height fluctuations) might(to some extent) reflect changes in economic conditions(e.g. through food price and hence food availability) and hence can be potentially used as economic indicators more relevant to human well being than traditional econometric characteristics like GDP. Although one can be a bit skeptical about this upshot, many economic historians take this route, see e.g.[4]. Present investigation was motivated by somewhat more modest goal of comparison between periodic properties of height series and some economic(e.g. food price)seriesasaroughcheckofsensibilityofthepreviousattemptstouse heights as indicators(then, for instance the periodicity properties should be roughly similar ). Obviously, such comparisons straightforwardly lead to the need to estimate the spectra. Nevertheless, it is immediately clear, that the fact that all the heights below MHR are discarded precludes standard statistical/spectral analysis and calls for a model that corrects for this complication. We will formulate onesuchmodelinthesection2. 2 Model The available data consist of(about 17 thousand) measurements of adult Swedish army soldiers, 22 years or older at the time of measurement, born between 1711 and 1864). To assess periodicity properties of the Swedish healthy male population height time series(indexed by birthyear), we outlined the following simple(linear) model(1) with normally distributed errors. Its form has been proposed after certain amount of data explorations and discussions with anthropometric history experts. where: Y tij = µ+α i + βt+σ F k=1 (δ 1kcos(2πtf k )+δ 2k sin(2πtf k ))+ǫ tij (1) Y tij istheheightof j-thmanfromgeneralswedishpopulation,bornin yeartat i-thbirthlocation. Timeisindexedbybirthyear, t=1correspondsto1711. Birthlocationisindexedas i=0for unknown, i=1for rural, i=2for urban. ǫ tij N ( 0; σ 2),independentlyacross t, i, j s

4 36 Marek Brabec Due to its linear(additive) structure, interpretation of the model is rather simple. It tries to assess amount of variability associated with periodic movementsofvariousfrequencies f 1,... f F aftercorrectingforpossiblebirthlocation differences and for possible(common) linear trend(as a simple form of non-stationarity). It is precisely the correction, together with the fact that the data are not balanced(having different sample sizes for different birthyears) which calls for a formulation in regression style and which precludes straightforward periodogram estimation based on standard estimators,[2]. Note that in the non-trigonometric part, the model resembles analysis of covariance. It fits a common linear trend shifted up or down differently at different birth locations, allowing for different average heights at rural/urban locations, a phenomenon which has been well documented for both historicalandrecentheights[5].weuseaparticularparametrizationwith α 0 =0, whichmeansthatthemeanheightofamenbornatunknownlocation(either ruralorurban)isgivenby µpluslinearandtrigonometrictermsintime. α 1 correspondstothedifferencebetweenmeanheightofmenbornatthesame yearatruralandunknownlocations.similarly α 2 correspondstodifference between urban and unknown locations. In order to roughly mimic periodogram estimation of variability at Fourier frequencies,wechoosefrequencies f i = 1 75,1.5 75, 2 75, ,37 75,whichareclose towhatthefourierfrequencieswouldbe,ifwehadasingleseriesoflength150 (withfrequency added,basedonpreliminarydataexplorations). Model(1)isnicelyinterpretableandwouldbeeasilyfittedtodatafrom historical Swedish population-representative male height surveys. The only flaw is that unfortunately no such surveys were performed. Available military data are non-representative of the general male population due to the MHR enforcement(and discarding the below-mhr measurements, see 1). Nevertheless, the mis-representation can be corrected rather easily, if we think ofthemilitarysampleasofasamplefromgeneralpopulation,whichislefttruncatedatthemhr.then,fortheavailablemilitarydata Y tij,wegetthe truncated regression model(2) from the original OLS model(1). Y tij remainsunobservedif Y tij < τ t Y tij = Y tij if Y tij τ t Y tij = µ+α i + βt+σ F k=1(δ 1k cos(2πtf k )+δ 2k sin(2πtf k ))+ǫ tij ǫ tij N ( 0; σ 2) α 0 =0 (2) NotethattheMHR(τ t )cangenerallyvaryintime.ideally,itshouldbe known from military regulations. Practically, it is not completely known and have been expertly estimated by historians(prof. Komlos). We have used both constant and time-varying MHR estimates and found that in terms of the main goal of the analysis. i.e. of rough spectral shape assessment, there are no substantial differences. More careful version with time-varying MHR wasusedtogetresultsinsection3,however.

5 Regression with truncated data 37 We make use maximum likelihood approach to estimate unknown parameters(µ, α 1, α 2, β, σ 2, δ 11,... δ 1F, δ 21,... δ 2F )ofthemodel(2).because of truncation, the model becomes nonlinear and no explicit formulas for the MLE s are available. Therefore, we maximize the loglikelihood(which is still rather easy to wright down) numerically, using a Newton-Raphson-like routine. We use the S-plus, especially the censorreg environment,[10] for the necessary computations. 3 Results and discussion Before discussing the periodicity properties, we did some tests of the model(2). Namely, we used asymptotic likelihood ratio test(lrt) to check whether: i) birth-location specific intercepts are necessary(p < for H 0 : α 1 = α 2 =0),ii)lineartimetrendisnecessary(p <0.0001for H 0 : β = 0). No opening for a substantial simplification of the model structure was detected here. Toassessperiodicity,onecanlookatMLEestimatesˆδ 1k,ˆδ 2k, k=1,2,...,38.or,moreconvenientlyatˆγ k = δ1k 2 + δ2 2k, k=1,2,...38(respectivelytheirsimpletransformationtodecibels:10.log 10 (ˆγ)).Theyareanalogous to periodogram estimate of raw spectrum. Figure 2 compares raw ˆγ k s(dots)withtheirsmoothedversions(solidlineforthesmooth,dotted line for pointwise computed 95% confidence interval limits). Smoothing is smoothed spectrum in db frequency Figure 2: Spectrum estimate.

6 38 Marek Brabec based on loess(locally linear) regression, namely its robust version,[3], with unequalweightingofˆγ k s,accordingtotheirasymptoticvariancesobtained as a byproduct of the MLE fitting procedure. Fromthere,wecanseethattherawestimatesfallreasonablywithinthe confidence limits(although these are necessarily too narrow to be thought of as simultaneous confidence bands due to their construction which guarantees only pointwise and not simultaneous coverage), except for estimates at two large frequencies. The overall shape of the spectrum is of non-trivial shape. Especially low frequencies are prominent(most likely remnants of some longtermtrendthatisnotestimatedseparatelyinmodel2andhenceisconfounded with long-periodic trigonometric part). Frequencies around 0.25(periods ofaboutfouryears)seemtocontributelesstotheheightserieschanges.on the other hand, high frequency part of the spectrum is presented substantially (especially periods shorter than, say 3 years). This is interesting, since this picture resembles results of[11], who did spectral analysis on another height data(not truncated and collected in a completely different way, at different times and locations). To get additional checks of the results concerning overall spectral shape, we re-estimated it under several alternative model modifications in the sensitivity analysis style. We have tried: i) both time-varying and constantexpertestimatesofmhr,ii)freeand σ 2 -restrictedmodels(σ 2 = σ 2 0 withexternallyexpert-supplied σ 2 0 anapproachthathasbeenadvocated in the past,[1]) as a way to circumvent correlation between mean-related and scale parameter estimates introduced by truncation, iii) combination of left truncation and additional interval censoring that can be suspected in connection with rounding, iv) simultaneous left and right truncation(to assess the possibility of over-representation of extremely tall men in the military sample in addition to the MHR complication), v) addition of quadratic trend, vi) omission of any trend whatsoever. Variants i) through vi) influenced the non-trigonometric part of the model and absolute values of the spectrum to various extent. Nevertheless, the spectrum shape remained remarkably similar and insensitive to the model perturbations considered. Ingeneral,wenotethatwhileintercept-likecoefficients(µ, α 1, α 2 )are rather difficult to estimate precisely, the slope-type coefficients(β, δ s) are estimated much more precisely. This is because the likelihood surface has anear-ridgee.g.inthe µ-σ 2 plane(seefigure3forasituationwithone particular birhtyear-year of 1751). Consequently, periodicity properties can be appreciated much better than intercepts determining average height per se.verticalshiftismoreuncertainandhenceitismuchhardertoanswer questions about absolute height, compared to questions about shape of their changes in time. Apart from overall(smoothed) spectrum estimate(which was required by the customer for explorative purposes), one can think of testing individual harmonictermscontributions(e.g. H 0 : δ 1k = δ 2k =0).Forsimplicity,the screening tests can be performed as Wald tests(using asymptotic variance-

7 Regression with truncated data 39 sigma o mu Figure 3: Likelihood surface example. covariance matrix of parameter estimates). Only few components were significant, namely those corresponding to periods of 75, 50, 25 and years. Resulting reduced model(with full non-trigonometric part and four harmonic components only) can be tested against the original model via LRT, and it yields p=0.47. From this, one would get the impression, that the original model can be dramatically simplified. We do not recommend such simplification, however. Coefficient estimates for different sine and cosine terms are correlated here(unlike in the classical Fourier analysis,[2]). This lack of orthogonalityisduetothefactthatthedataareunbalanced,thatwedonot work exactly with Fourier frequencies and because of truncation. It is of practical interest that the smooth spectral density estimate ŝ(f) canbeintegratedas I 1 p i+1 (pi,p i+1)= 1 ŝ(f)dftogetsomeideaaboutamount p i ofperiodicvariabilityinvariousperiodintervals(p i, p i+1 ).Thesecanbe standardized to get proportions. We have estimated proportions of variability in four sub-intervals of(2, 15) years interval(which has been intensively investigated in connection with buisness cycle in the past,[11]). They are: I (2,3) I =0.58, [3,5) I =0.20, [5,7) =0.09, I [7,10) =0.07, I [10,15) =0.06,corresponding rather nicely to proportions estimated in[11] by different methods under different circumstances.

8 40 Marek Brabec References [1] A Hearn B.(2004). A restricted maximum likelihood estimator for truncatedheightsamples.economicsandhumanbiology2,1,5 19. [2] Brockwell P.J., Davis R.A.(1991). Time series: Theory and methods. Springer, New York. [3] Cleveland W.S., Devlin S.J.(1988). Locally-weighted regression: An approach to regression analysis by local fitting. J. Am. Statist. Assoc. 83, [4] Easterlin R.A.(2000). The worldwide standard of living since J. Econ.Perspect.14,7 26. [5] Floud R., Wachtel K., Gregory A.(1990). Height, health and history. Nutritional status in the United Kingdom, Cambridge University Press. Cambridge. [6] Heintel M.(1996). Historical height samples with shortfall, a computationalapproach.historyandcomputing8,1, [7] Heintel M., Sandberg L., Steckel R.(1998). Swedish historical heights revisited: New estimation techniques and results. Proceedings of the conference The biological standard of livinging in comparative perspective. Komlos J., Baten J.(eds.) Franz Steiner Verlag. Stuttgart. [8] Komlos J.(1989). Nutrition and economic development in the eighteen century Habsburg monarchy: An anthropometric history. Princeton University Press. Princeton, NJ. [9] Komlos J.(1993). The secular trend in the biological standard of living in the United Kingdom, Economic History review 46, [10] Meeker W.Q., Duke S.D.(1981). CENSOR- A user-oriented computer program for life data analysis. The American Statistician 35, 2, 112. [11] Woitek U.(2003). Height cycles in the 18th and 19th centuries. Economic andhumanbiology2, Address: M. Brabec, Department of Biostatistics and Computing Services, National Institute of Public Health, Šrobárova 48, Praha 10, Czech Republic mbrabec@szu.cz

CHAPTER 21: TIME SERIES ECONOMETRICS: SOME BASIC CONCEPTS

CHAPTER 21: TIME SERIES ECONOMETRICS: SOME BASIC CONCEPTS 21.1 A stochastic process is said to be weakly stationary if its mean and variance are constant over time and if the value of the covariance between