A longitudinal model for litter size with variance components and random dropouts

Size: px

Start display at page:

Download "A longitudinal model for litter size with variance components and random dropouts"

Oliver Hunter
5 years ago
Views:

1 A longitudinal model for litter size with variance components and random dropouts Claus Dethlefsen & Erik Jørgensen Dina Research Report no 50 Ž 1996

2 A longitudinal model for litter size with variance components and random dropouts Claus Dethlefsen Erik Jørgensen z July 31, 1996 Abstract In this case study, a model for longitudinal litter size data is considered Data consist of up to 18 measurements of litter size for 7899 sows in 12 Danish herds The litter sizes for a sow is modelled as a multivariate Gaussian response when selection is disregarded The expected litter size at different parities for a specific herd is modelled to be a parametric, nonlinear curve The covariance for measurements within a sow is modelled with random effects, serial correlation and measurement error Sows are subject to selection bias (culling), identified as random dropout The dropout process is modelled as a Markov process with logistic regression on past observations The chosen observation period introduces (non-informative) censoring Parameter estimates are obtained by maximum likelihood estimation, which however gives computational problems A simulation study is performed to investigate the effect of culling It is concluded that the herds generally differ with respect to the mean, covariance and dropout structure Key words: Nonlinear mean; Random effects; Serial correlation; Normal distribution; Selection bias; Censoring; Culling; Logistic regression; Maximum likelihood estimation; Likelihood ratio test Department of Biometry and Informatics, Research Centre Foulum, PO Box 39, DK-8830 Denmark dethlef@iesdaucdk, WWW: ~dethlef z Department of Biometry and Informatics, Research Centre Foulum, PO Box 39, DK-8830 Denmark ej@dinaspdk, WWW: ~ej 1

3 2 Preface This paper is a product of a research project at Research Centre Foulum during the spring of 1996 The data have been extracted from the data base of the Danish Applied Pig Research Scheme by the National Committee of Pig Breeding, Health and Production [Pedersen et al, 1995] Herds are numbered randomly to ensure anonymity to the pig producers Acknowledgements: Statisticians at Research Centre Foulum and Aalborg University have participated in fruitful discussions of the work Especially Søren Lundbye- Christensen deserves thanks for many useful comments during the writing of this paper

4 1 Introduction 3 1 Introduction With todays large herds and competition in pig production, it is of great importance for the pig producers to make the right decisions at the right time to ensure a high economic return This has created a need for systems which can aid in decision-making Such systems can be incorporated within the framework of the Danish Integrated Farm Management System [Andersen et al, 1992] In order to maximize the economic return, a key production trait is the number of pigs produced, ie the litter sizes of the sows in the production One instrument to increase the litter size is to replace (cull) a sow with a younger sow when this is judged to yield a more feasible production The problem of choosing the optimal strategy is called the animal replacement problem, and is discussed in [Kristensen, 1993] Here, dynamic programming techniques are applied to a dairy cow replacement model under a milk quota Application of dynamic programming to sow replacement is found in [Huirne et al, 1988] Introduction Mating Pregnancy Farrowing Culling Figure 1: A diagram of the states and their relations for a sow in pig production The lifetime of a sow in the pig production is sketched in Figure 1 The normal cycle is a repetition of the states mating, pregnancy and farrowing until the sow is culled After an introduction, a sow enters this cycle and when she shows signs of oestrus, the sow is mated and the pig producer determines whether or not the sow is pregnant Approximately 20% of the sows do not get pregnant and, if detected, the pig producer decides whether to cull the sow at this point or whether to remate them During the pregnancy (lasting for approximately 115 days) only a few sows are culled At the end of this period, the sow is placed in the farrowing department where the sow farrows The pigs suckle for approximately 28 days after which they are weaned Another 5 days later, the sow can return to oestrus and will then be ready to reenter the normal cycle At this final point, it is decided whether or not to replace the sow If the sow is culled, it does not return to the pig production, hence culling is an absorbing state Examples of factors upon which the decision of culling a sow can depend, are: The

5 2 Initial data analysis 4 litter size at the last farrow as well as previous litter sizes, the age of the sow and reproductive failure (ie no oestrus or no pregnancy) Finally, the size of the herd or the overall age of the herd can affect that a sow is culled The main goal of this investigation is, in a parsimonious way, to describe the evolution of litter sizes The description must take into account that the average litter size will be biased due to the selection based on the sows results From this, a dynamic programming model can be used to suggest an optimal culling strategy so that the average litter size for a herd is maximized This framework is thus one approach to solve the estimation problem mentioned in [Jørgensen, 1992, p19] 2 Initial data analysis In the following the data are described After a description of general features of the data, the exploration focuses on describing the mean, covariance and dropout/censoring structure which will be the basis for the model stated in the next section This model is analyzed in the succeding sections 21 Data Description The data set consists of litter sizes (number of piglets born alive) observed at different parities (number of farrows) for sows in the production of pigs The data are from 12 different pig production herds in Denmark and the following attributes have been observed: day of farrowing, sow number, parity number and litter size Some features of the herds are shown in Table 1 From this it is noted, that the data set consists of many, short time series, ie more than 421 and up to 1188 sows in each herd and on average 34 observations per sow Therefore we classify the data as longitudinal data The two main reasons for the few observations per sow are high culling rates (40-50% per year, [Jørgensen, 1992]) and relatively short periods of observation Although the periods range from 13 to 79 years, the duration of one cycle (farrowing interval) is approximately 148 days so not all sows have all their cycles reflected in the observations Figure 2 exploits the raw data For each herd, the litter sizes for the sows are plotted against the corresponding parities Data points are shown as dots, and data from certain sows are connected with lines These sows are selected as having the median and highest sum of litter sizes in the herd Since the scales are discrete, a small random

6 2 Initial data analysis 5 Herd Number of sows Observations Average obs per sow Length of period (years) Average litter size Total Table 1: Summary statistics of the 12 herds with the herd number, the number of sows observed in the observation period, the number of observations, the average number of observations per sow, the length of the observation period and the average litter size for the herd Note that the number of sows is not the number of sows present at a given time, but the number observed during the entire observation period

7 2 Initial data analysis 6 max median Herd Litter size max median Herd 2 max median Herd 3 max median Herd Litter size max median Herd 5 max median Herd 6 max median Herd Litter size max median Herd 8 max median Herd 9 max median Herd Litter size Parity max median Herd Parity max median Herd Parity Figure 2: Plots of litter size versus parity for all sows in each herd Sows with median and maximum sum of litter sizes in each herd are shown with connected lines

8 2 Initial data analysis 7 quantity has been added in both directions to each point to be able to give an impression of the density Note that the density of points is low for parities greater than 10 We will assume that the observed litter sizes for sows at given parities are realizations of normally distributed random variables The normal plots for herd 5 in Figure 3 show that we can assume the litter sizes for sows at a given parity to be random samples from normal distributions Normal plots for other herds are similar Parity 1, N = Quantiles of data Parity 2, N = 460 Parity 3, N = 385 Parity 4, N = 306 Parity 5, N = Quantiles of data Parity 6, N = 202 Parity 7, N = 158 Parity 8, N = 123 Parity 9, N = Quantiles of data Quantiles of standard normal Parity 10, N = Quantiles of standard normal Parity 11, N = Quantiles of standard normal Parity 12, N = Quantiles of standard normal Figure 3: Normal plot for each parity of data from herd 5 Superimposed are the lines through the 1st and 3rd quartiles from the data, and the corresponding quantiles of the standard normal distribution Independence between herds can be assumed, and all sows within a herd are from the same breed We will assume independence between sows, but this can be questioned, since sows may belong to the same family, they may be mated by the same boar and finally since they are placed in the same sheds, they may have behaviorial effect on each other We expect correlation between measurements on the same sow and homo-

9 2 Initial data analysis 8 geneity of covariance structure will be assumed for sows in the same herd The sows in a herd are subject to a biased selection The pig producer keeps the sows with the larger litter sizes in the production while culling the less productive sows This has influence on the sample litter mean at a given parity which will tend to be higher than the sample mean if the selection was not present The expected litter size when culling is not present will be called the true mean, and one aim of this analysis is to describe the evolution of the true mean when the parity increases The time series for a sample of 21 sows from herd 1 and 9 are illustrated in Figure 4 The plots on the left hand show the litter size plotted against the day of observation Herd 1 Herd 1 Litter size Herd 1 Herd 1 Litter size Herd 9 Herd 9 Litter size Herd 9 Herd 9 Litter size Days Parity Figure 4: Time series for a sample of 21 sows from herd 1 and 9 The plots on the left hand show litter sizes for sows plotted against the day of farrowing The plots on the right show litter sizes for the same sows plotted against parity and on the right hand against parity This corresponds to a realtime overview of the time series as opposed to a comparison of litter sizes with the same parity, the latter being the way we view data The average time distance between neighbouring litters is around 148 days and denotes the duration of one cycle (farrowing interval) The

10 2 Initial data analysis 9 time between two parities is not always the same since the sow might not be detected as non-pregnant and so will reenter the cycle after for example 300 days since her last litter However, equidistance will be assumed, as parity will be used instead of days The variation of litter size in time is not considered an important source of variation 22 Covariance structure To get an impression of the dependency structure, the sample covariance matrices for the herds have been calculated The entries are generally based on a different number of observations, so corrections have been made to handle this The sample covariance matrix for litter sizes in parity 1 10 in herd 1 with correlations displayed below the diagonal is S We observe, that the variances in the diagonal are almost equal By averaging the correlations in the subdiagonals, we find that the correlations decrease with the lag, indicating that an autoregressive term contributes to the variation However, it is not obvious whether an eg intra-class variance component is present The sample covariance matrices for all herds have the same features as the one above This makes it reasonable to assume that the covariance matrices for all sows in a herd can be described by the same structure Under the assumptions of normality and independence, Bartlett s test can be performed to test hypotheses concerning homogeneity of variance in each parity Although these assumptions are not fulfilled, we use the test as a guideline, knowing that Bartlett s test is very sensitive to deviations from the normality-assumption The hypotheses of equal variances in one herd at a time are generally rejected at a 5% significance level A further examination reveals that the variance of the measurements in parity 1 are lower than the rest and so, the hypotheses are generally accepted when these measurements are excluded The assumption hypothesis of homogenity of variance over the herds are rejected, even when excluding observations at parity 1 Nevertheless, we will assume variance homogeneity 3 7 5

11 2 Initial data analysis 10 When considering repeated measurements, at least three sources of variation arise [Diggle et al, 1994, p79] Let Y ij be the litter size for sow i at parity t j = j in a given herd We assume that Y ij can be decomposed as Y ij = ij + " ij where ij is a deterministic trend to be explored in Section 23, and " ij is a random, zero-mean component, described in the following Suppose that " ij can be written as three independent terms Then the three sources of variation can be interpreted as " ij = A i + M i (t j ) + Z ij : (1) Random effects: Effects which arise from the characteristics of individual sows, combining the effect of genetic level and permanent environmental effect We assume, that A i N (0; 2 ), iid and 2 thereby measures the variation between sows Serial correlation: Measurements taken close together in time are typically more strongly correlated than those taken further apart in time We assume that this within sow variation can be represented as an autoregressive process where fm i (t j )g has autocovariance function (u) = 2 exp( u), where u is the time difference This corresponds to assuming fm i (t j )g to be AR(1) The autoregressive processes for sows within a herd are assumed to share the parameters and 2 Measurement error: This source of variation arises from short term random influences, and we assume that Z ij N (0; 2 ), iid To examine if these variance components are present, we use variograms For an introduction to variograms, we refer to [Diggle et al, 1994, p51] The theoretical variogram for the ith stationary process f" ij g in (1) can be expressed as and the process variance is V (u) = 1 2 E [f" i;j " i;j u g 2 ] = 2 f1 exp( u)g + 2 Var(" ij ) = : The sample variogram for herd 5, when the sample mean at each parity is subtracted, is shown in Figure 5 and gives the impression that measurement error and serial correlation is present while the presence of random effects can be questioned Note however, that due to the dropout process, the response process is not stationary Therefore the variogram does not directly reflect the variance components This may

12 2 Initial data analysis 11 Variogram Lag Figure 5: Sample variogram for herd 5 with the sample mean for each parity subtracted explain why the sample variograms for some of the other herds are decreasing and crossing the process variance Due to this uncertainty we choose to include all three components in the model stated in Section 3 23 Exploration of mean structure We will use the term sample mean curve for the sample mean of the litter sizes at each parity for one herd Figure 6 shows the sample mean curves for each one of the 12 herds Generally, the curves are close together until parity 10 when the curves fluctuate widely, a result of only a few observations beyond this point (3% of the total number of observations) Although the sample curves are decreasing towards the end of the observation period, they will tend to be higher than the true curves (without culling) This happens, since the sows producing small litter sizes are replaced and do not contribute to the sample mean The sample mean curves are all increasing to a maximum in parity 4 5 after which a steadily decrease is revealed A polynomial fitted to the values will add a high leverage to points with high parities and so, the measurements in parities above 10 will have high influence on the shape of the polynomial This can be avoided if a suitable nonlinear curve is used Another feature of polynomials is that extrapolation beyond the limits of estimation often is very inadequate In Section 3, we will use a non-linear curve to describe the mean This curve has the property that extrapolation slightly beyond the working range is consistent with the estimated curve in the working range

13 2 Initial data analysis 12 Average litter size Parity Figure 6: Sample mean curves for each herd The herd numbers are attached at the endpoints 24 Missing values We use the term missing values whenever some of the litter sizes of a sow have not been observed from parity 1 to parity 18, which is the period of interest No sows have observations through the entire period Figure 7 shows that for higher parities, less observations are present Missing values occur in the data set for the following two reasons The term dropout is used in [Diggle and Kenward, 1994] when a subject leaves the study prematurely and does not return However, to distinguish dropouts from censoring (as described below), we will use the term synonymous with culling, ie a sow drops out if observations are missing due to culling In [Diggle and Kenward, 1994] three types of dropouts are considered: completely random dropouts (CRD), random dropouts (RD) and informative dropouts (ID) The distinguishing feature is on which measurements the probability of dropping out before next observation are allowed to depend In the case of CRD, this probability does not depend on observations, while in RD dependency on previous observations are incorporated The most general type of dropouts is ID where the probability of dropout before next observation can depend on both previous and future observations In our case we only consider the case of RD since the culling strategy is expected to depend on previous litter sizes Another reason for the presence of missing values is censoring Here, observations are missing in the light of the chosen observation period Censoring can be classified as either left, right or double censoring Left censoring occurs when measurements

14 2 Initial data analysis 13 Observations Parity Figure 7: Histogram of all observations grouped after parity The number of observations is heavily decreasing with parity in the beginning of a time series are missing These measurements are missing since the observation period begins when the production is in progress Right censoring occurs when measurements in the end of a time series are missing This happens if the observation period ends before the sow is culled Finally, double censoring occurs if both observations in the beginning and in the end are missing Right censoring is to be distinguished from the case when a sow is subject to culling However, the data set contains no information of the reason for missing values Instead all sows having measurements after the 150th day before the end of the observation period are considered right censored due to end of observation period All other sows are considered culled Table 2 shows the number of sows that are subject to the different kinds of censoring Note that in our case, censoring could be specified as non-informative censoring, since it is random which sows are censored We define a sample culling curve as the fraction of sows being culled at parity j + 1 given the litter size at parity j, plotted against the litter size at parity j The sample culling curves for the 12 herds are presented in Figure 8 and indicate that sows tend to be culled if their current litter size is small Towards the endpoints, the number of observations are small and the curves thus fluctuate more We will assume that the probability of culling at a given parity only depend on the previous observation of litter size (the Markov property)

15 2 Initial data analysis 14 Herd Left Right Double Not censored Total Total Table 2: Number of sows from each herd being subject to different kinds of censoring Fraction culled next time Litter size Figure 8: Sample culling curves for the 12 herds The number of observations decreases towards the endpoints

16 3 The model Summary In the preceeding sections, we have made the following observations and assumptions Data consist of up to 18 measurements of litter size for 7899 sows in 12 Danish herds We view this as many, short time series, and classify the data set as longitudinal data The sample mean curve for a herd (cf Figure 6) increases from parity 1 and reaches maximum in parity 4 5 thereafter decreasing steadily The response, litter size, covers positive integers (bounded above) but we assume that the litter size for a sow at a given parity is a realization of a normally distributed random variable The number of measurements per sow is small due to culling and censoring The culling of a sow is a biased selection and we expect the probability of culling to depend on the previous litter size Censoring is non-informative We assume that observations of litter sizes for sows in different herds as well as in the same herd are independent We expect the succesive observations of litter sizes for one sow to be dependent The dependency structure is assumed to be the same for all sows within a herd We assume that the variance structure for sows within a herd can be expressed in terms of Random effects Serial correlation Measurement error 3 The model The model for litter size consists of three parts: a part describing the common mean for a herd, a part describing the common covariance structure for the sows in a herd, and a part describing the dropout process in a herd

17 3 The model 16 Consider a herd of m sows with up to n measurements per sow In order to model the dropout process, we introduce a hypothetical variable Y, denoting the litter size for ij sow i at parity j which would have been observed if no missing values were present Then Y is an n-dimensional generic vector of measurements for the ith unit, and we i assume that N n (; V): Y i Similarly, Y ij and Y i denote the actual observations with missing values coded as 0 Let d i be the index of the first missing value due to culling in Y i Hence, d i indicates the dropout time if 2 d i n, whereas d i = n + 1 identifies no dropout If, for example, the ith sow is not subject to censoring, the relationship between Y and Y ij ij can be expressed as Y ij = Y for j < d ij i and Y ij = 0 elsewhere The common (intended) parities are denoted as t j = j; j = 1 : : : n The expected litter size for sow i at parity j (without missing values) is ij = E (Y ), ij which we have called the true mean in previous sections Note that the sample mean curve for a herd with culling will differ from ij The sample mean curve will tend to be higher, since sows producing few pigs are likely to be culled Letting i = E Y, i the trend will be modelled as a nonlinear curve so that (t) = 1 exp (t2 1) t: (2) This curve is suggested in [Jørgensen, 1992] and can be re-parametrized in terms of more easily interpretable parameters [Jørgensen, 1992, p10] The expression in (2) is a combination of a straight line and a Gaussian curve For certain parameter values, the straight line dominates for large parities while the Gaussian term causes the curve to bend downwards for values near parity 1 We let denote the vector of parameters used to describe the mean structure The common covariance matix of Y will be formulated parametrically with the three i components described in Section 22 Then the covariance matrix can be written as Var(Y i ) = V = 2 J + 2 G + 2 I: Here, J is a matrix with all its elements equal to 1, I is the identity matrix, and G is the matrix with components G ij = exp( jt i t j j) The parameters describing the covariance matrix are collected in the vector = ( 2 ; 2 ; 2 ; ) The dropout process will be parametrized as a logistic model of the form logit(p j ) = 0;j + 1;j y j 1 : Here, p j is the conditional probability of dropout between parity j 1 and j given previous observations Hence, the dropout process is classified as random dropout [Diggle and Kenward, 1994, p50] and is assumed to satisfy the Markov property The vector parametrizes the dropout process

18 4 Inference 17 4 Inference Inference is based on the likelihood function which will be derived in the different combinations of culling and censoring The discrepancy of the maximized value of likelihood functions for different models will then be used to test if it is possible to reduce the starting model To measure the variation of the estimated parameters, we consider the information matrix for and a confidence band for the mean curve 41 The likelihood function The likelihood function resembles that of [Diggle and Kenward, 1994] except that censoring has to be considered Due to the independence of sows, the likelihood function is a product of the individual contributions Therefore we only need to consider the likelihood function for one sow and we therefore suppress the index i The different situations arising from censoring and dropout will be considered separately Let f (y j j) denote the univariate Gaussian density of Y and f (y j ) the multivariate j k Gaussian density of the k to j elements of Y If H j = fy 1 ; : : : ; y j 1 g denotes the measurements up to and including parity j 1, let f (y j jjh j ) denote the conditional univariate Gaussian density of y j given H j Similarly, f j (y j ), f (y j ) and f k j(y j jh j ) are defined, but these are not neccesarily Gaussian distributions We note that if the sow only have zeros due to the dropout process, P (Y j = 0jH j ; Y j 1 = 0) = 1 P (Y j = 0jH j ; Y j 1 6= 0) = p j (H j ) f j (y j jh j ) = f1 p j (H j )gf j (y jjh j ), for Y j 6= 0 In all cases, the likelihood function is calculated as follows: Firstly, the joint density is considered a product on the non-gaussian univariate densities conditional of the past Secondly, these conditionals are expressed in terms of the Gaussian conditional univariate densities of the hypothetical stochastic variables This requires a factor of the probability of no culling Finally, the Gaussian univariate conditional densities are multiplied to give a multivariate joint distribution This chain of thought is illustrated in the first case No censoring and no culling, ie Y = (Y 1 ; Y 2 ; : : : ; Y n ) The sow has survived during the entire observation period This case is very

19 4 Inference 18 unlikely, but is an illustrative case The joint density for this sow is given by f (y n 1 ) = f 1(y 1 ) = f 1 (y 1) ny j=2 ny j=2 f j (y j jh j ) n o f (y j jjh j ) (1 p j (H j )) ny = f (y n 1 ) (1 p j (H j )) j=2 No censoring, culling, ie Y = (Y 1 ; Y 2 ; : : : ; Y d 1 ; 0; : : : ; 0) The sow has been observed since the first parity and until it has been culled between measurement d 1 and d The joint density for this sow is given by f (y n 1 ) = f (y d 1 1 ) 8 < d 1 Y : j=2 9 = (1 p j (H j )) ; p d(h d ): Right censoring, no culling, ie Y = (Y 1 ; Y 2 ; : : : ; Y k 1 ; 0; : : : ; 0) The case of no censoring and no culling is a special case of this situation The sow has been observed since the first parity After this, all parities have been observed until the end of the observation period The sow has not been culled during the observations The joint density for this sow is given by Y k 1 f (y n 1 ) = f (y k 1 1 ) (1 p j (H j )) : j=2 Left censoring, no culling Y = (0; : : : ; 0; Y k ; Y k+1 ; : : : ; Y n ) At the beginning of the observation period, the previous measurements for the sow in the production have not been registered The sow is not culled during the observation period The joint density is f (y n k ) = f (y n k ) ny j=k (1 p j (H j )) Since p j depends on the previous values and these are not observed for all values, special care should be taken if p j is to be estimated The solution we choose is to consider the product from k + 1 till n instead This way we keep as much information as possible Although it seems odd not to use the same measurements for making inference, we believe that this approach is justified since inference for (; ) turns out to be independent of inference for

20 4 Inference 19 Left censoring and culling, ie Y = (0; : : : ; 0; Y k ; Y k+1 ; : : : ; Y d 1 ; 0; : : : ; 0) The litter sizes of the sow have not been measured in the beginning of the sow s production Measurements are registred until the sow is culled f (y n k ) = f (y d 1 k ) 8 < d 1 Y : j=k 9 = (1 p j (H j )) ; p d(h d ) As in the previous item we choose to consider the product from k + 1 till n instead, since the history at parity k has not been observed Double censoring, no culling, ie Y = (0; : : : ; 0; Y k ; Y k+1 ; : : : ; Y l ; 0; : : : ; 0) The observation period is short compared to the lifetime of a sow, so that neither the first observations nor the last observations have been registred f (y n 1 ) = f (y l k ) ly j=k (1 p j (H j )) As in the other cases of left censoring, the product is taken from k + 1 in our situation In all cases, the joint densities can be split into a term depending only on and and a term involving only Therefore, the overall log-likelihood function for a herd can be written as `(; ; ) = `1(; ) + `2() which is the special case of the situation in [Diggle and Kenward, 1994] using the assumption of random dropout Estimation of the parameters can be accomplished by maximizing the overall loglikelihood function This can be done as suggested in [Diggle and Kenward, 1994] by using the simplex algorithm described in [Nelder and Mead, 1965] a version of which is implemented in Matlab The maximization is simplified by maximizing `1 and `2 separately For technical reasons, we have transformed the parameters 2 ; 2 ; 2 and 2 since these should be positive Note also, that inference about and (The Y - process) can be done by using only `1 Tests for reduction of a model are carried out as in [Diggle and Kenward, 1994] by using likelihood ratio test, 2 log Q 2 (df ), where df is the difference in dimension of parameter spaces in the competing models and Q is the ratio of the maximized likelihood functions Note that we use the special symbol,, to indicate that the relation only holds approximately The symbol : = is used similarly

21 4 Inference Standard errors for parameters The Fisher information for based on all sows is the sum of the Fisher information from each sow Further, as inference for (and ) is independent of inference for, we need only consider the multivariate Gaussian factor in the expression of the likelihood function for one sow Censoring is handled as above by considering only the relevant entries in and V for each sow To simplify notation, we generalize from censoring and consider the stochastic vector of litter sizes without culling for one sow, say Y N k ((); V()) From this, the relevant part of the log-likelihood function for this sow is given by `1(; ) = constant 1 2 (Y ()) > V 1 ()(Y ()); from which it can be deduced that the ijth entry in the Fisher information for this sow is given by i() = E " = D > V 1 ()D; j where D is the matrix [@ i =@ j ] In our situation this matrix has the ith row equal to > = [ exp( 2 (t 2 1)); i 1 (t 2 1) exp( i 2 (t 2 1)); 1; t i i ]: It can be shown that in our case, the estimates b and b are asymptotically independent and therefore we do not need to consider the full information matrix for and as described in [McCullagh and Nelder, 1989, p472] The information matrix, I, for (p-dimensional) based on all sows can be used to calculate the sample covariance matrix for b, since the following relation holds approximately [Dobson, 1990, p53] from which it follows that b N p (; I 1 ); (4) ( b ) > I( b ) 2 (p); (5) The statistic ( b ) > I( b ) is called the Wald test statistic, applicable for testing the hypothesis = 0

22 5 Simulation study Reference curves One application of the sample covariance matrix for the estimates of the mean, is the construcion of confidence limits around the estimated mean curve This can be accomplished by the following first order Taylor-expansion of the mean curve around the true value, defining d > i b i = ( b ; t i ) : = (; t i ) + px k=1 = (; t i ) + d > i ( b ); as the row vector [@ i =@ > ] in (4) Then, b : E [( ; b t i )] = (; t i ) : Var[( ; t i )] = d > I 1 d i i ; from which the confidence band is t i k ( k b k ) 5 Simulation study A simulation study has been performed to investigate how culling affects the sample mean curves and the variograms, and to see if the estimation method gives the correct parameter estimates Three culling strategies were chosen and 50 simulated herds with 500 sows with up to 15 measurements were constructed for each strategy The simulated herds have not been subject to any kind of censoring since this is classified as non-informative in our situation and thereby does not affect the parameter estimates The mean and variance parameters in all the performed simulations are similar to those estimated from the real herds We have rounded the simulated litter sizes to integers to enhance the similarity with the real herds = (3:37; 0:13; 13:39; 0:36) > = (0:35; 1:68; 5:28; 0:10) > In the first two culling strategies, the parameters for the culling curves were chosen to show the extremes of no culling and heavy culling The latter refers to the simple culling strategy where a given litter size is the turning point between culling and no culling Thereby only little randomness is incorporated into the culling strategy In addition to the two extreme culling policies, a moderate culling strategy is included to reflect a situation similar to the real herds The parameters are chosen as

23 5 Simulation study 22 No culling Heavy culling, = (19; 2:2) Moderate culling, = (0:12; 0:15) After estimation, three herds have been picked out to act as representatives for each of the three groups The results from these three simulated herds are depicted in Figure 9 The best resemblance between the true (dotted lines) and estimated (solid lines) curve No culling Heavy culling Moderate culling Mean litter size Parity Parity Parity Variogram Lag Lag Lag Culling curve Litter size Litter size Litter size Figure 9: Estimated (solid lines), true (dotted lines) and sample (dots) mean curves, variograms and culling curves for three simulated herds with different rates of culling is in the case of no culling and moderate culling In the case of heavy culling, the mean curve and variogram are overestimated, while the estimated culling curve is very close to the true one The dots are the sample versions of the curves and especially in the case of heavy culling the effect of culling is clear Here, the sample mean curve is higher than if culling was not present as only the highly productive sows are kept A mean curve obtained by ordinary least squares estimation would lie very close to the

24 5 Simulation study 23 sample curve and thereby overestimate the mean curve (see also [Diggle and Kenward, 1994]) The overall impression of all simulations is that the case of no culling always leads to parameter estimates close to the true values When any form of culling is present, the estimated culling curves and mean curves are mostly close to the true curves, though the mean curve in some herds is overestimated Walds test in (5) has been applied in each of the 150 herds to test if the parameter estimates for the mean is equal to the true mean parameters All these tests yielded p-values above a 5% significance level From this we conclude that the estimation method does not introduce bias when estimating the parameters for the mean The variograms, however fluctuate quite a bit when culling is present Although the process variance generally is well estimated, the parameters describing the individual sources of variation are not well estimated in all cases Figure 10 shows extreme examples of the bad fit in cases of culling Since the variance components are well estimated in the case of no culling, the deviations from the true values must be explained by a combination of the complexity of the covariance structure and the relatively small number of measurements when culling is present The process variance is well estimated in all cases, so the difficulties lie in the separation of the variation in the three components The large number of simulations makes it possible to estimate the true covariance matrix for by forming the empirical covariance matrix of the estimated parameters This can the be compared with the inverse information matrix, obtained by summing the contributions from the individual sow in (3) As an example, consider the following matrices where S is the empirical matrix and I 1 is the inverse information matrix for the simulated herd with moderate culling from Figure 9 Correlations are below the diagonal 2 S = :86 0:24 5:59 :62 0:04 0:35 :03 50:19 7:02 :77 87:08 67:70 :10 79:69 50:01 89: ; I 2 1 = :70 0:02 1:05 :10 0:01 0:08 :00 7:93 2:05 :19 54:76 51:27 :04 40:88 21:40 70:78 Comparing these matrices, we see that the correlations are of the same magnitude although the covariances differ quite a bit The approximation of the covariance matrix for can thus only be used as a guideline to see if correlation between parameters is present The high correlations are induced by the nonlinear curve chosen to model the mean Possibly a reparametrization as in [Jørgensen, 1992, p10] could reduce the correlations The simulation study has revealed that the estimation method yields parameter estimates for the mean and culling curve that are very close to the true parameters The :

25 5 Simulation study 24 Heavy culling Heavy culling Moderate culling Mean litter size Parity Parity Parity Variogram Lag Lag Lag Culling curve Litter size Litter size Litter size Figure 10: Estimated (solid lines), true (dotted lines) and sample (dots) mean curves, variograms and culling curves for three simulated herds with different rates of culling These are extreme examples of bad fit

26 6 Data analysis 25 parameters used to describe the variance components are mainly estimated close to the true values, but may fluctuate when culling is present The difficulty lies in separating the different components as the process variance is generally well estimated Culling has an effect on sample mean curves and sample variograms in the way that these sample curves tend to lie above the true curves 6 Data analysis We have estimated various models of the form described in Section 3 This has essentially been done by using the function fmins in Matlab to maximize the log-likelihood functions using the simplex method Functions from the library OSWALD for S-Plus as well as SAS procedures either fail or do not have the features for simultaneously handling the nonlinear mean structure, the complicated covariance structure and the large fraction of missing values Further, an implementation in BUGS revealed that no convergence was reached after 1500 iterations (15 hours on a Sun Sparc 10 station) on one herd and therefore this approach was considered too time consuming The current implementation in Matlab is very flexible as it is possible to specify almost any mean structure and covariance structure However, the maximization of the log-likelihood function is very slow (around 1 hour per herd per model on a Sun Sparc 10 station) This is partially caused by the complexity of the problem and the large dataset, but also by the method of maximization The simplex method is robust but not very efficient Other methods are described in [Press et al, 1988, ch10] in terms of implementation in C and might speed up the estimation process The starting model is the one described in Section 3, assigning a set of parameters to each herd Thus one herd has 4 parameters to describe the mean, 4 to describe the variance components and 2 to describe the Markovian dropout process The result after estimation of the parameters is presented in Table 3 and the curves for three selected herds are shown in Figure 11 The parameter estimates for the herds roughly divide the herds into three groups The three herds in Figure 11 represent these groups Herd 1 shows the characteristics of the group consisting of herds 1, 2, 5, 8 and 11 These herds have a relatively large number of observations per sow so that the confidence bands around the estimated mean curve are narrow The estimated variograms show that the variance components for these herds are separated into a large measurement error, a correlation between measurements on the same sow that decreases slowly with the lag and random effects varying in size from herd to herd The second group includes herds 3, 4, 6, 7, 10 and 12 These have relatively few measurements per sow and the variance components consist of a small measurement

27 6 Data analysis 26 Herd 1 Herd 3 Herd 9 Mean litter size Parity Parity Parity Variogram Lag Lag Lag Culling curve Litter size Litter size Litter size Figure 11: Estimated (solid lines) and sample (dots) mean curves, variograms and culling curves for herd 1, 3 and 9 Confidence bands are superimposed around the estimated mean curves

28 6 Data analysis 27 Herd Table 3: Parameter estimates for the 12 herds, are parameters for the mean, = ( 2 ; 2 ; 2 ; ) are parameters for the variance components and are the parameters for the dropout process error, a rapidly decreasing correlation between measurements on the same sow, and random effects The last group has the sole element of herd 9, which has extremely few measurements per sow Serial correlation is not present in the variogram and the confidence band around the mean curve is very wide The approximate covariance matrix for the mean parameters for herd 1 with correlations below the diagonal is given by 2 I 1 = :16 0:01 0:52 0:05 0:01 0:07 0:01 2:03 1:09 0:10 28:42 54:23 0:01 9:13 32:15 64: : (6) This is similar to the covariance matrices shown in Section 5 and again we see a high correlation between the parameters We have estimated various submodels to be able to test if a reduction of the starting model is possible As explained in Section 41, the log-likelihood function splits into a term involving only (; ) and a term only dependent of Therefore, we have considered submodels for these two terms separately As to the first term, we have estimated models with combinations of common/individual and parameters and models without serial correlation For the second term we have considered models with

29 7 Conclusion 28 combinations of common 1 and 2 parameters and models without 2 All likelihood ratio tests for such reductions of the starting model resulted in zero p-values We conclude that the starting model is also the final model When the number of observations per sow is small, difficulties arise especially when estimating the variance components This can be the reason why the herds roughly separates into three groups on basis of their estimated variance parameters From Table 1 we see that except for a few exceptions the groups can also be classified according to the number of measurements per sow This indicates that the number of observations per sow affects how the variance parameters are estimated 7 Conclusion The mean curves for the 12 herds differ and 4 parameters are used to describe the evolution of the expected litter size for each herd All three variance components, described by 4 parameters, are present and the herds have different covariance matrices Finally, the dropout process for each herd is described by 2 parameters and is a Markov process We find, that this gives an adequate, parsimonious description of the evolution of litter sizes for a herd and, as this description copes with the biased selection, we believe that the goal of the investigation has been reached For future applications, we suggest that the method of maximization of the likelihood is improved to speed up the estimation These results are based on the assumption that the model described in Section 3 is true This has been verified by superimposing the fitted curves on plots of the sample curves as in Figure 11 These plots have been very difficult to interpret due to culling and therefore a simulation study has been performed This revealed that culling affects the sample mean curves so that this is above the estimated curve On this background we have accepted the starting model which also turned out to be the final model as various hypotheses of reductions were rejected It is assumed that the observations of a sow are multivariate normally distributed independent of measurements of other sows The litter sizes are measured on a discrete scale, but are nevertheless assumed to be realizations of (continuous) normally distributed random variables The mean is common for the sows in the same herd and can be described by the nonlinear parametric curve in (2) This curve is chosen empirically and induces high correlation between parameters, see (6) Three sources of variation are considered: random effects, serial correlation and measurement error All components are needed to describe the common covariance matrix for sows in a herd, though the estimated parameters vary from herd to herd This can possibly be due to

30 7 Conclusion 29 the culling and censoring that causes the number of observations per sow to be low in some herds The simulation study indicated that difficulty of estimating the variance parameters arises when culling is present The dropout process is modelled as a Markov process so that the probability that a sow drops out at a given parity depends on the litter size at the previous parity The low number of observations per sow implies that existing statistical procedures in S-Plus, SAS and BUGS cannot estimate the desired parameters Therefore the parameters has been estimated by use of Matlab, although the maximization of the likelihood function is very slow Finally, the simulation study revealed that the estimation method does not introduce bias in the parameters for the mean We find that this paper arises the following statistical problems which can be subject for further research Exchange the multivariate Gaussian density function in Section 41 with other densities Since the response variable is measured on a positive discrete scale, it may be natural to use an eg Poisson distribution However, this will arise new problems, for example how to model the dependency structure Examine further the relation between the relative number of observations and the difficulty in estimating the variance parameters We have indicated that when many missing values are present, it is difficult to estimate the variance parameters Possibly this could be stated more precisely Calculate predicted values for A i and M i (t j ) in order to establish residuals r ij = y ij (b ij + b A i + c M i (t j )) and enable further model checking Find a reparametrization of the nonlinear mean curve in (2) so that the correlation between the parameters for the mean are reduced

31 References 30 References [Andersen et al, 1992] E Andersen, H P Bay, N Bloch, P M Dyrvig, N B Jensen, J J Jørgensen, E Maegaard, and C Gottlieb-Petersen Integrated Farm management Systems in Denmark Developed by the agricultural advisory service and used by farmers and advisers In Proceedings of the 4th International Congress for Computertechnology in Agriculture Paris-Versailles 1st-3rd of June 1992, pages , 1992 [Diggle and Kenward, 1994] P Diggle and MG Kenward Informative dropout in longitudinal data analysis Applied Statistics, 43:49 93, 1994 [Diggle et al, 1994] P Diggle, Kung-Yee Liang, and Scott L Zeger Analysis of Longitudinal Data Oxford Science Publications, 1994 [Dobson, 1990] A J Dobson An introduction to generalized linear models Chapman & Hall, 1990 [Huirne et al, 1988] RBM Huirne, AA Dijkhuizen, GWJ Giesen, and ThHB Hendriks Economic optimization of sow replacement decisions by method of stochastic dynamic programming Journal of Agricultural Economics, 39: , 1988 [Jørgensen, 1992] Erik Jørgensen Sow replacement: Reduction of state space in Dynamic Programming model and evaluation of benefit from using the model Dina Research Report 6, pp 21, 1992 Department for Resarch in Pigs and Horses National Institute of Animal Science Research Centre Foulum [Kristensen, 1993] A R Kristensen Bayesian updating in hierarchic Markov processes applied to the animal replacement problem European Review of Agricultural Economics, 20, [McCullagh and Nelder, 1989] P McCullagh and JA Nelder models Chapman & Hall, 2 edition, 1989 Generalized linear [Nelder and Mead, 1965] JA Nelder and R Mead A simplex method for function minimization Computer Journal, 7: , 1965 [Pedersen et al, 1995] B K Pedersen, V Ruby, and E Jørgensen Organization and application of research and development in commercial pig herds: The Danish approach In D P Hennessy and P D Cranwell, editors, Manipulating Pig Production V Proceedings of the Australasian Pig Science Association (APSA) Canberra, November 26 to 29, 1995, 1995 [Press et al, 1988] William H Press, Brian P Flannery, Saul A Teukolsky, and William T Vetterling Numerical Recipes in C Cambridge University Press, 1988

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations