Seminar on Longitudinal Analysis

Size: px

Start display at page:

Download "Seminar on Longitudinal Analysis"

Godwin Bishop
5 years ago
Views:

1 Seminar on Longitudinal Analysis James Heckman University of Chicago This draft, May 20, / 191

2 Review of Data Collection Techniques Before addressing issues regarding the modeling and estimation of individual histories, we first review the type of data available on processes such as fertility, employment and unemployment, and other state transition processes. For illustration, suppose we examine the characterization of a two-state unemployment/employment process over continuous time with asynchronous switching. In the next figure, let t = 0 be the starting date and T 0 be the termination date. Individual I experiences three transitions in this time period, while II makes only two 2 / 191

3 Two-State Unemployment/Employment 1 I II 0 t=0 T 0 0 = unemployed 1 = employed 3 / 191

4 Event history data Event history data provides the analyst with information on: 1 all of the switches of each individual 2 length of time spent in each state This data is extremely rare. A few examples of such data sets: Fertility studies of the Bicol region of the Philippines Another from Denmark Seattle employment study These studies are available only for a finite length of time. Thus, the ending date of the survey censors the final observation: the duration of the individual s stay in the last state is unknown. 4 / 191

5 CPS data collection The CPS data collection technique actually provides exceedingly little information for longitudinal analysis despite its breadth. The CPS survey follows a group of individuals in the following fashion: 1 At time t = 0 an initial survey is performed. 2 After 9 months, a second survey is made. 3 After 11 months, a followup is performed. The following four case histories, each represented as a single broken line (line for unemployment and break for employment), illustrate how choppy the CPS data is. 5 / 191

6 CPS Survey A B C D t=0 2 months T=9 T=11 t 6 / 191

7 The questioning about past history of unemployment spells is such that: A is recorded as one uninterrupted spell of unemployment B is also recorded as an uninterrupted spell C is a spell of known duration and ending date D is known only as lasting between 2 and 9 months and is right-censored. Such data doesn t even allow computation of an unbiased mean duration of unemployment spells, let alone allowing us to answer questions regarding the dependence of spell length on past history of the individual or an calendar time. 7 / 191

8 Prospective Point Sampling A good deal of occupational information is collected on this basis, sometimes (misleadingly) called continuous history data. Compared with the CPS framework, this scheme is very simple. At chosen sampling intervals, the individual reports the state he or she currently occupies. The detail of such surveys depends greatly on the chosen sample interval and the characteristics of the transition process. 8 / 191

9 Prospective Point Sampling One-year intervals may be sufficiently narrow for a fertility series, but too wide for employment transitions. Multiple transitions may occur within the interval. Unless retrospective information is gathered at the sampling dates to fill in such gaps, this method may convey too little information regarding the time dynamics of the underlying process. 9 / 191

10 Event Count Data Such surveys record the number of switches between states only. There is no information on duration, incidence rates or spacing. 10 / 191

11 Partial observability This is a situation arising when the degree of variability recorded is not the same across variables. In some surveys, earnings is recorded on a yearly basis while unemployment is followed at three-month intervals. Methods to study correlation between disparate series will be addressed. 11 / 191

12 Partial observability The moral of this review to those involved in data collection design is 1 Collect all retrospective data 2 the wider the chosen sampling intervals, the noisier the recording of the underlying process although the surveying may be less costly 3 the more characteristics regarding each individual are collected, the easier to control for individual errors in reporting. 12 / 191

13 Duration Models Standard hazard specifications A duration model characterizes the probability of an event occurring as a function of elapsed time, when the waiting time is governed by some random distribution. Let T be the waiting time until the occurrence of an event. [ t ] S(t) = exp h(u) du 0 = the probability that T exceeds some time span t h(t) = d g(t) [ log S(t)] = dt S(t), where g(t) is the density associated with G(t) = 1 S(t). 13 / 191

14 S(t) is called the survivor function and h(t) the hazard function, and G(t) is the distribution of spell lengths. Note that when discussing a given population of individuals and a single event which each individual experiences only once, the functions have the following interpretations: S(t) = the fraction of people who have not yet experienced the event f (t) = rate of occurrence per unit time h(t) = rate of occurrence per individual still to experience the event, or at risk 14 / 191

15 A crude approximation to the survivor function S(t) is a step function giving the number surviving as a percent of the total starting population. Each time an event occurs, the number of survivors falls in a discrete fashion. 15 / 191

16 Survivor Step Function 1 S(t) 0 t 16 / 191

17 An interesting transformation of the S(t) function is t log S(t) = h(u) du. 0 the curvature of this integrated hazard may be read from the next graph as either A the rate of occurrence is slowing down with time and the number of individuals who leave, B the rate of occurrence is speeding up, or C the rate of occurrence changes with time and the number of individuals who leave. 17 / 191

18 Curvature of Integrated Hazard - log S(t) B C A t 18 / 191

19 Giving a functional form to the specification of the hazard is a problem of balancing off flexibility to incorporate all of cases A, B and C against conveyance: finding a parametrization which is easy to work with. Six types of hazard function are given below: 1 The simplest is of course a constant hazard over time: h(t) = C for all t > 0. Somewhat restrictive. 2 The Weibull family of distributions offers simplicity and more flexibility: h(t) = αt α 1, α > 0. α = 1 then this yields a specific constant hazard. 0 < α < 1 then h(t) is decreasing in t. α > 1 then h(t) is increasing in t. 19 / 191

20 3 The Gompertz distributions also offer flexibility and convenience, with a problem when the parameter γ is negative. h(t) = exp(γt). For γ > 0, h(t) is increasing. For γ < 0, h(t) is decreasing, but the distribution function is defective. The probability P(T > t) = S(t) does not go to zero as t goes to infinity: h(u) du = exp(γu) du = 1 [exp(γt) 1] for γ < 0 γ S(t) = exp [ 1γ ] (exp(γt) 1) ( ) 1 lim S(t) = exp 0 t γ 20 / 191

21 4 The Box-Cox hazard is a commingling of the preceding two: h(t) = [ ] γt λ 1 exp λ h(t) Weibull lim λ 0 λ = 1 Gompertz γ = 0 constant hazard 21 / 191

22 5 The unimodal parametrization allows h(t) to be initially increasing and subsequently decreasing. The inflection point of curve C is the mode of the hazard function. h(t) = λα(λt)α (λt) α 0 < α < 1 then h(t) is decreasing from + α = 1 then h(t) is decreasing from +1 α > 1 then h(t) is unimodal with mode α 1 λ 22 / 191

23 6 Finally, a more complex hazard function with a concrete basis for the parametrization is the following: Let t = elapsed calendar time x = age of an individual female A hazard specification for the fertility of women who have had a child might be as follows: h(t, x) = ρη(x)(η(x)t) α 1 exp [ η(x)t] 23 / 191

24 Notice the relation between this function and the unimodal hazard of 5. There the parameter which governs the mode is η(x). By specifying that η(x) declines with the age of the female, the declining fertility of older women is incorporated into the hazard. An example of η(x) would be 0 if x < 14 1 if 14 x < 25 η(x) = [ ( 1 x 25 ) τ ] 2 if 25 x < if x / 191

25 1 0(x) x 25 / 191

26 Mixtures of constant hazard rates Even in the simple model in which all individuals have some constant hazard rate, θ, and these rates are distributed across the population according to some mixing density, m(θ), problems of identification arise. First, we show that such a mixing model always generates a declining proportional hazard rate, and then discuss distinguishing such a model from one in which each individual has a decreasing hazard rate. 26 / 191

27 This form of heterogeneity, a mixture of constant hazard rates is common in the duration analysis literature. S(t) = 0 exp( θt) dµ(θ) where dµ(θ) is equivalent to m(θ) dθ, with m(θ) = mixing density of θ in the population θ = unobserved heterogeneous hazard rates 27 / 191

28 For a population with such a combination of individuals, the proportional, or population hazard rate, is always decreasing over time: h(t) = d dt [ ln S(t)] = θe θt dµ(θ) 0 e 0 θt dµ(θ) d dt h(t) = [ θe θt dµ(θ) ] 2 0 e θt dµ(θ) θ 2 e θt dµ(θ) 0 0 [ e 0 θt dµ(θ) ] 2 By the Cauchy-Schwartz inequality for L 2, the sign of d h(t) is dt negative. 1 1 ( x(s) y(s) ds ) 2 x 2 (s) ds y 2 (s) ds. Let x 2 (s) ds = 0 e θt dµ(θ) and y 2 (s) ds = 0 θ 2 e θt dµ(θ) 28 / 191

29 This result is quite general but the converse is not true. One restriction that helps but does not completely resolve the difficulty of distinguishing between mixtures of constant hazards and mixtures of decreasing hazards is that of complete monotonicity. That is, mixtures of constant hazards must have derivatives which alternate in sign: ( 1) n d n S(t) dt n 0 for all t 0, n I + If data is fine enough to support differentiation, then a simple sufficient condition (see Feller 1971, Vol. II) for this property to be violated is h (t 0 ) + 3h(t 0 )h (t 0 ) h 3 (t 0 ) > / 191

30 However, it may be that an alternative specification with decreasing hazards, S(t) = is identical to a constant hazard model 0 where the new mixture is properly defined. 0 e θtα dµ(θ), 0 < α < 1 (1) e θt dµ (θ), (2) Both (1) and (2) are completely monotone, and of course observationally equivalent. 30 / 191

31 This underidentification can be overcome only by further a priori assumptions. If only densities with finite means are allowed then (2) may be ruled out. For example, if dµ(θ) is a gamma density, then dµ (θ) must have a fat tail. Alternatively, if there is a theoretical restriction on the time dependence, which is presumed the same for all individuals, i.e., α is known, then the mixture dµ(θ) is identified. 31 / 191

32 Actuarial Estimators Non-parametric estimates of S(t) and t h(u) du will involve 0 maximum likelihood estimators over a function space. No distributional assumptions regarding m(θ) are made. The first example of such a likelihood is an actuarial estimator on a single-state process. 32 / 191

33 The actuarial estimator is based on the following observation: if failure times across a population are governed by the same distribution but the information on the process gives only the number of survivors at arbitrary time intervals, I 1 I 2 I 3 t 1 t 2 t... 3 t k then the survivor function evaluated at each one of the time points S(t k ) = P(T > t k ) = P(T > t 1, T > t 2,..., T > t k ) = P(T > t k T > t 1, T > t 2,..., T > t k 1 ) P(T > t k 1 T > t 1, T > t 2,..., T > t k 2 ). P(T > t 1 ) = P(T > t k T > t k 1 ) P(T > t k 1 T > t k 2 ) P(T > t 1 ) 33 / 191

34 Notice that if T > t k 1, then T > t j, for j < k 1. If there are η 1 individuals at risk at the beginning of time interval I 1, then an estimated probability of surviving each interval is P(T > t j ) = 1 1 η 1. The estimated survivor function is Ŝ(t k ) = k j=1 (1 1ηj ). 34 / 191

35 If the number of failures within an interval is greater than one, then the atuarial estimation may be modified: Let d i be the number of terminations in interval I i. Ŝ(t k ) = k j=1 ( 1 d j η j The Kaplan-Meier non-parametric maximum likelihood estimator is an extension of this actuarial construct. S(t) = ( 1 d ) j. η ti <t j ). Here t 1 < t 2 < < t are the actual times at which individuals experience the event: d i = number of individuals exiting at the i th event time η i = number of individuals at risk at the i th time 35 / 191

36 Given the non-parametric likelihood, in order to perform hypothesis testing, we need a standard deviation as a function of time. Greenwood s formula gives γ(t) = S(t) kt d 0 (η j)(η j + 1) j=1 k t = value of k such that t [ ] t (k), t (k+1) 36 / 191

37 The Aalen estimator of integrated hazard is a good descriptive device which uses the same technique used in procedures for estimating multi-state transition rates which may involve complicated time dependence. Where the survivor function is ( S(t) = exp t 0 ) h(u) du. A procedure to estimate the integrated hazard sums the ratio of exiting individuals to the number remaining at risk at each event time: t h(u) du = d i 0 η i i=t i <t 37 / 191

38 Let S(t) be the Kaplan-Meier estimator. The relation of the Aalen estimate to S(t) may be seen by the following transformation: ( ln 1 d ) i = ( ln 1 d ) i = η i η i i=t i <t t 0 h(u) du 38 / 191

39 Each term of this Kaplan-Meier integrated hazard, when written as a series expansion is ( ln 1 d ) i = d i d i 2 + d i 3. η i η i 2ηi 2 6ηi 3 The Aalen estimator ignores the higher order terms of this expansion. If the number at risk is large, most of the weight in this series does indeed fall on the first term alone. The Aalen estimator also tends to correct the bias introduced by the nonlinear transformation ln(s(t)). 39 / 191

40 In failure time models, once an individual experiences the event, he is out of the pool of individuals at risk for the rest of the survey. In the following example taken from economics, transitions between unemployment, employment and being out of the labor force may be repeated any number of times. The Aalen estimator may be used in this context as well in testing a hypothesis regarding the transition rates between states. 40 / 191

41 Given three employment states, there are six possible transition rates. U E O 41 / 191

42 Define the following event rates: r U,E (t) = expected number of unemployment to employment transitions per unit time, per individual at risk. r O,E (t) = expected number of transitions from out of the labor force to employment Flinn and Heckman pose the question of whether unemployment and out of the labor force should be designated as separate classifications. 42 / 191

43 To test this, they examine the hypothesis H 0 : r U,E (t) = r O,E (t). First, the assumption that transitions depend on past history is made. Suppose data available gives a counting process over the entire population. The six cumulative counts are N e (t) = (N U,E (t), N N,E (t), N E,U (t),...) Note that individuals may appear in these counts more than once. 43 / 191

44 The Aalen estimator for the integrated transition rate is where t 0 r U,E (u) du = k:0 t k <t d k Y U (t U,E (k) ) Y U = the number of individuals in state U at time t. Although Y U need not decline monotonically with multistate flows, it is still the number of individuals at risk in state U. 44 / 191

45 The test relies on the query, how frequently do events occur per time period per individual? The event times of the two transition patterns do not have to match, but having complete event history data is crucial. O, E U, E The test asks whether is equal to t 0 t 0 s 1 s 2 s 3... s k t 1 t 2 t 3... t k r U,E (u) du = r O,E (u) du = k:0 t k <t k:0 t k <t d k Y U (t U,E (k) ) d k Y O (t O,E (k) ) 45 / 191

46 If event history were not available, and instead prospective point data were, then multiple intermediate transitions would be unobservable. To infer what jumps occurred between observed points, on might try to fit a Markov or semi-markov process. 46 / 191

47 Estimation of Separable Hazard Models In this section, the problem of estimating the form of time dependence is addressed. Specifically, the sensitivity of the time dependence estimate to the form of the distribution of unobserved heterogeneity assumed and to the parametrization of the time dependency is examined. Finally, the nonparametric method of the EM algorithm is presented as an alternative to the standard maximum likelihood methods. 47 / 191

48 If the survivor function is of the separable form S(t x) = [S 0 (t)] exp X f β then partial likelihood estimation will yield a ˆβ estimate, and parametric specifications of the time dependence, such as h 0 (t) = αt α 1 or h 0 (t) = e αt, where ( t ) S 0 (t) = exp h 0 (u) du 0 may be estimated with standard techniques. Given the variety of functional forms that might be chosen for the time dependence, on would perhaps like to data to speak for itself, in the absence of any theory on time dependence. 48 / 191

49 A nonparametric approach would proceed in the following fashion: Choose a priori a set of time intervals that need not correspond to jump times associated with occurrences of transitions. They must satisfy the condition that they are long enough to be non-empty of events. e α 1 for 0 t < t 1 e α 2 for t 1 t < t 2 h 0 (t) =. e α k for t k 1 t < t k 49 / 191

50 This places no restrictions of monotonicity on the number of modes for the hazard function. The integrated hazard will be t 0 h 0 (u) du = k e αj (t j t j 1 ) j=1 for a stepwise hazard, as in the graph below: 50 / 191

51 Stepwise Hazard Function / 191

52 The estimated hazards are where α k = ln d k = ln [ i R k δ i exp ( ) ] X β e i e d k = number of individuals who experience the event of interest in the half open interval [t k 1, t k ) R k = set of individuals at risk in the interval [t k 1, t k ) δ i =1 if the individual is uncensored. Again, δ i points out that non-censored individuals provide information regarding time dependence. 52 / 191

53 If the estimated ĥ 0 (t) is of the form in the graph (A), which appears unimodal, this would support a specification of h 0 (t) = λα (λt)α (λt) α for α > / 191

54 (A) Estimated ĥ 0 (t) h(t) t 54 / 191

55 (B) Estimated ĥ 0 (t) t 55 / 191

56 Survivor Analysis and GMLE A brief look at the theory underlying generalized maximum likelihood follows. (See Kiefer and Wolfowitz.) Define dµ(x) to be a dominating measure if for dp 0 (x) = f (x) dµ(x), µ(a) = 0 implies that P 0 (A) = 0. Let P be the measure of all probability measures. For every pair of probability measures P 1 and P 2 in the class P, define dp 1 (x) f (x; P 1, P 2 ) = d(p 1 + P 2 )(x) 56 / 191

57 Radon-Nikodym derivative Here, f (x; P 1, P 2 ) plays the role of the likelihood. The measure ˆP is a generalized maximum likelihood estimator if f (x, ˆP, P) f (x, P, ˆP) or d ˆP(x) d(ˆp + P)(x) dp(x) d(ˆp + P)(x). 57 / 191

58 Let T 1, T 2,..., T n be the times an event of interest occurs in a population surveyed. To extend the analysis, we now allow a censoring of the data in a random manner. Let C 1, C 2,..., C n be the censoring times: this is equivalent to an individual dropping out of a sample population before experiencing the event in question. 58 / 191

59 The observable is Y i = min(t i, C i ). At each time Y i, an individual either makes the transition out of the state having experienced the event or drops out from the sample s observed population prematurely. Let the δ i variable indicate whether y i is censored or not: { 0 if the individual is censored δ i = 1 if not Consider the probability distributions on x = ((y 1, δ 1 ), (y 2, δ 2 ),..., (y n, δ n )). 59 / 191

60 Finding a generalized maximum likelihood estimator of the hazard rate is the same as finding a P(x) such that the observed events are given the maximum probability: P(x) = n { Pr(T = y(i) ) } δ i {Pr(T > yi )} 1 δ i. i=1 The first bracketed term is the probability of an event occurring at exactly time T. It is given weight only if the event is uncensored, i.e., when δ i = 1. The second term is the probability that the event occurs any time after the time T, which is the most information that a censoring at T yields. It receives weight only if δ i = / 191

61 More succinctly, the problem is to ( n n ) max p i p j for p 1, p 2,..., p n 0. i=1 j=1 When the number of individuals is finite, the sum of the number of events and censoring must also be finite. (The two are equal.) 61 / 191

62 When the time partitioning is fine enough so that no two events occur exactly simultaneously, then the solution to the maximum problem is exactly the Kaplan-Meier estimate: ˆP i = δ i 1 η i + 1 j=1 ( 1 ) δ j. η j + 1 A comparison of time dependent rate estimates depends on the assumption of a homogeneous population. Finding differences across covariates requires further investigation. 62 / 191

63 Duration Models with Covariates Complicating the underlying process by postulating that it is affected by some covariates (which we assume are observable, for now) leads to gains in estimation efficiency to some specification of the form of duration dependence. The issue which arises, on which statisticians and social scientists are divided, is whether to model the durations (i.e., waiting times) themselves, or to model the rates of exit (the hazard function). Application of a standard regression framework to durations imply complex hazard specifications. Statisticians tend to favor modeling curves to fit the hazard itself with more convenient parameterizations. 63 / 191

64 Regression Framework In order to apply linear regression to waiting times, a standard technique is to take a log transformation, mapping non-negative waiting times onto the entire real line. Then a symmetric disturbance term ε i may be applied to a regression of log durations: ln t i = β 0 + k β j x ji + ε i. j=1 x ji = the value of the j th covariate for individual i. ε i = iid, Φ(0, δ 2 ) where Φ is the normal cdf. 64 / 191

65 The theorizing here is at the level of linking the covariates to the expected value of log duration. The hazard is subsumed in this specification, and is worth examining. Although the waiting times are straightforward, the hazard is complex (read crazy ). 65 / 191

66 Let ln T = β 0 + X + ε [ ] T = exp β 0 + X β + ε. The conditional survivor function is ( ( ) ) P(T > t X ) = P exp β 0 + X β exp (ε) > t ( [ ]) = P exp (ε) > t exp β 0 X β ( ) = P ε > ln t β 0 X β ( ) = 1 Φ ln t β 0 X β. β 66 / 191

67 The survivor function is S(t) = 1 Φ(T ) = exp [ t ] h(u) du. 0 The hazard my be retrieved as well by differentiation. t [ ( )] h(u) du = ln 1 Φ ln t β 0 X β 0 1 φ(ln t β t 0 X β) h(t) = 1 Φ(ln t β 0 X β). Another drawback of this method is having to know the completed waiting times t i. Traditionally, one may know only transition times censored in some fashion. 67 / 191

68 Consider a More General Approach This approach postulates a general separable hazard specification where the time dependent portion is multiplicatively separable from the portion which varies with X, some vector of covariates, i.e., h(t X ) = ψ(t)u(x), where ψ(t), u(x) are non-negative valued functions. 68 / 191

69 An example of such a hazard is the Cox specification: h(t X ) = ψ(t) exp(x β) or, ( t ) exp h(u X ) du = [S 0 (t)] exp(x β ), 0 where ( t ) S 0 (t) = exp φ(u X ) du. 0 A full maximum likelihood estimator of the hazard and the coefficients on the covariates is constructed as follows. 69 / 191

70 Define D i = indicator for individuals who experience an event of interest at time i. C i = indicator for censored individuals: those who drop out of the sample before they experience an event. By looking at the conditional survivor function, we find the contribution to the likelihood for individuals experiencing the event is [ S0 (t (i) ) ] exp(x β ) [S 0 (t + (i) ) ] exp(x 1β ) where t + (i) denotes the time just after t (i), i.e., t (i) / 191

71 For a censored individual, the contribution is The likelihood is therefore [ S 0 (t + (i) ) ] exp(x β ). L = [ [S0 (t (i) ) ]exp(x β) ] exp(x lβ [S 0 (t + (i) + ) l D i [ [ ] ] exp(x lβ S 0 (t + (i) ) ). l C i ) ] 71 / 191

72 Following a method suggested by Cox, if one is interested in the covariates, then one might maximize the partial likelihood. This ignores the time-dependent part of the hazard, separating out the changes which enter through the covariates. Given an additional requirement that the covariates X be time-invariant, (see B. Efron). The Cox likelihood proposal is max β n i=1 exp(x iβ ) l R(t (i) ) lβ) exp(x where R(t (i) ) is the number of individuals at risk at time t (i). 72 / 191

73 Heuristically, this term may be related to the full maximum likelihood by the following argument. The conditional probability that person (i) experiences an event at t (i) given that R(t (i) ) individuals are at risk and that exactly on event occurs at time t (i) is h(t X i ) = P(t < T < t + δ T > t) φ(t) iβ). exp(x 73 / 191

74 By the definition of conditional probability, this is ( ) the fraction with covariates φ(t) (i))β i who exit in interval (t, t + z) exp(x = X ( ) l R(t + ) φ(t) lβ) total population at risk exp(x as of time t Making the assumption that the nature of time dependence is the same for all individuals, φ(t) cancels out, yielding one element of the Cox partial likelihood objective. 74 / 191

75 A Diagnostic Check for Specific Hazards Given the expense of generalized maximum likelihood estimation, it is useful to have simple diagnostic tests of the specification of the time dependence in a duration model. A test of proportional hazards under Weibull time dependence can be performed by a graphical technique: Let S(t x ) = P(T > t X = x ) = [S 0 (t)] exp(x β ) Taking a log transformation twice over, we have ln S(t x) = ln S 0 (t) exp(x β) ln( ln S(t x)) = ln( ln S 0 (t)) + X β. (3) 75 / 191

76 When the time dependence part of the survivor function is of the Weibull family, then S 0 (t) = exp( t α ) and the first term of the equation (3) is ln( ln S 0 (t)) = α ln t and ln( ln S(t x)) = α ln t + X β. 76 / 191

77 Graphing ln( ln S(t x)), a double log transformation of the survivor function against ln(t) should produce a family of straight lines of the same slope, α. The intercepts should vary according to X β specification of S(t x). under this separable If the graph does not conform then this convenient specification does not apply. 77 / 191

78 Double log Transformation of Survivor Function Against ln(t) ln( -ln( S(t) ) ) slope = V ln( t ) 78 / 191

79 More generally, unless parallel curves are generated by plots of ln[ ln S(t x )] against ln t for various chosen values of the covariates x, then the assumption of separable cannot hold. Example, see J. Menken, J. Trussel, D. Stampel, O. Babokol, Demography, Vol. 18, 1981 pp (on marital dissolution) 79 / 191

80 A Duration Model from Economic Theory In a search model of unemployment, the Poisson arrival of new job offers and a reservation wage strategy of income maximizing workers generates observed unemployment spells which have fundamentally non-separable hazards. Heterogeneity across workers is likely to exist either in costs of search or wage offer distributions. This model is outlined here as an example of a duration model based on optimizing behavior by economic agents. A thorough presentation may be found in Lippman & MaCall, (1976). 80 / 191

81 Let λ= Poisson encounter rate with new job offers V = value of search rv = reservation wage c = instantaneous cost of search r = instantaneous interest rate F (w) = distribution of wage offers, assumed to have finite mean. 81 / 191

82 Agents maximize income subject to the following scheme: If search cost C is incurred, job offers arrive at rate λ independent of c. Wage offers are independently drawn without recall from distribution F (w). Agents are infinite-lived and jobs last forever, having present discounted value w r. The value of search is { c t 1+r t V = + 1 λ t V + λ t E [ w 1+r t 1+r t max, V ] + O( t) if V > 0 r 0 otherwise. 82 / 191

83 Passing to the limit, we have c + rv = λ r rv (w rv ) df (w) and reservation strategy: { 1 if w > rv, job offer accepted d = 0 if w rv, job offer rejected. Then the probability that an unemployment spell exceeds duration t u, given hazard rate of exit (acceptance of a job) h u = λ(1 F (rv )), is Pr(T u > t u ) = exp( λ(1 F (rv ))t u ). 83 / 191

84 Returning to the entire population, with some observed characteristics x and unobserved characteristics θ, we have P r (T u > t u x) = exp [ λ(x, θ)(1 F (rv (x, θ)))t u ] dµ(θ) Θ where { Θ = θ 0 < λ } (w rv (θ, x)) df (w) c(x, θ) ω r rv Obviously, there is no general separable hazard specification that emerges. Instead, the hazard and covariates are linked by the solution to the Bellman equation, in which the reservation wage rv is implicitly defined. See Heckman and Singer (1984). 84 / 191

85 Duration Models with Unobservables In most applications, the analyst has data on some process, along with some observable characteristics regarding individuals included in the survey. In addition to these, there may be other characteristics of these individuals which are factors affecting the process but which are unmeasured. For example, in the Stanford study of heart transplant recipients, it is likely that each patient differed in some dimension, call it frailty, which affected their survival times after receiving transplants. Thus, the survivor function includes an unobservable θ: P(T > t x, θ) = exp( H(t)U(x)V (θ)) = exp( H(t) exp(x β + θ)) 85 / 191

86 The statistician has the following issues to address: what estimation strategy to use to obtain ˆβ in the presence of θ, and what estimation can reveal regarding the distribution of θ itself. Even when this distribution of θ is of no interest to the analyst, its presence will affect the consistency of estimation of β. Assuming some form of the mixing distribution of θ, dµ(θ), the data may be confronted with the integrated survivor function, where θ has been integrated out: P(T > t X ) = exp( H(t) exp(x β + θ))dµ(θ) 86 / 191

87 Commonly used functional forms for dµ(θ) are gamma and normal distributions Gamma: dµ(θ) = ba θ a 1 exp( bθ) dθ Γ(a) Normal: dµ(θ) = exp( (θ a)2 /2b) 2πb dθ. Both of these offer a flexible, analytically tractable and computationally convenient family of distributions. Both are fully described by the specification of two parameters a, b above. 87 / 191

88 For all the claimed convenience of these specifications, along with lognormal variation, they do not all yield the same qualitative estimates for the coefficients on time dependence and observed covariates. Sensitivity of these estimators is apparent in one study, the Heckman and Singer analysis of labor earnings data, but not in the Manton, Stallard and Vaupel study of mortality risks among the aged. 88 / 191

89 Mortality Risks Among the Aged I. Weibull Parameter Estimates: φ(t) = αt α 1 Age Gamma Heterogeneity Inverse Gaussian No Heterogeneity (.05) 5.88 (.08) 5.44 (.04) (.06) 5.98 (.09) 5.46 (.04) (.07) 6.35 (.11) 5.69 (.05) II. Gompertz Parameter Estimates: φ(t) = e γt Age Gamma Heterogeneity Inverse Gaussian No Heterogeneity ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) 89 / 191

90 Although the labor economics study shows qualitative differences in the model and in the duration dependence, the mortality study shows duration dependence that is robust to alternative specifications of the mixing distribution. It has yet to be shown what characterizes a data set which will be robust to alternative heterogeneity specifications. 90 / 191

91 Alternatively, in a study of child mortality, the specification of time dependence is shown to be sensitive to the inclusion of unobservables in the estimation strategy. Nonparametric maximum likelihood estimation has been shown to give unbiased estimates of covariate coefficients provided one has chosen a particular form of time dependence in Monte Carlo studies. This is where theory must play an important role. 91 / 191

92 0.6 w/ unobservables 0.4 est. by NPMLE unobservables ignored Weibull unobservables ignored NPMLE Gompertz 92 / 191

93 Without strong enough theory to suggest the time underlying functional form of time dependence, these studies suggest, the effect of covariates and time dependence cannot be distinguished, even with the use of a nonparametric estimation strategy. 93 / 191

94 The following digression on an area of statistics on the extraction of true test scores focuses on the issues that underlie the reasons why duration models with underlying heterogeneity need strong restrictions a priori from theory to obtain identification. The demands on the data from models of this sort are far greater than those of regression models. 94 / 191

95 True test scores are discussed in Lord and Novick, Statistical Theory of Mental Test Scores (1968). Let X = observed test score ξ = errors, with density h ξ u = unobserved true test score, with density g u X = ξ + u f (X ) = h ξ (X t) g u (t) dt 95 / 191

96 Notice that a survivor function of form S(t) = k(t θ) dµ(θ) is a general form of h ξ (X θ). The density of true test scores g u (t) is the analog of the mixing distribution. The question of both problems is fundamentally the same: When can an observed histogram be decomposed and purged of some noise? J.W. Tukey does so with strong assumptions regarding the densities h ξ and g u in Named and Faceless Values, Sankhya / 191

97 Heuristically, you want to decompose observed from true test scores, something that can be accomplished only with some already known characteristics of the errors: θ i = X i + σ2 Z (X X σx 2 i ) In this linear model, an empirical Bayes estimator θ uses a variance obtained from a previous study by ETS on errors in test scores to extract an estimate of u. 97 / 191

98 For the survivor analysis, we need a nonlinear version of this. For each individual, identify a normal density with mean θ i and variance σ 2 i where σ 2 i is the variance of θ i. The estimated time score distribution must be extracted from an observed histogram, where each portion of that histogram is one observation on an underlying distribution for that one individual. 98 / 191

99 ^ 0(2 i, F 2 i ) n 99 / 191

100 / 191

101 The estimated time score distribution is reflated by the normally distributed errors: g u (θ) = 1 n n η( θi, σ 2 i ). Again, the analytical device most useful depends on the question at hand. If each individual realization is important, the histogram is useful. If the population distribution is under discussion, the histogram must be reflated by the errors. 101 / 191

102 Nonparametric estimation of the mixing distribution Two approaches to extracting estimates of µ(θ), α and β with a nonparametric approach to µ(θ) are: 1 to use a maximum amount of information to obtain the largest number of jump points. If η is the number of points of increase used then N η is the number of remaining θ s available to calculate the probabilities along those discrete levels of dµ(θ). 2 to use a coarse distribution, say with 2 or 3 jumps and recompute the maximum likelihood for additional jump points until the likelihood no longer increases. 102 / 191

103 The model, including some simplifying notation, for each individual l yields a probability of experiencing the event at time t l is f (t l, θ X l) = f (t l ; λ j,l ) = αt α 1 λ j,l exp( t α λ j,l ) f (t l, θ X ) = f (t l ; λ j,l ) = = J p j αt α 1 λ j,l exp( t α λ j,l ) j=1 J p j f (t l ; λ j,l ) j=1 λ j,l = exp(x β + θ j ) J = number of points of increase in mixing distribution p j = probability of experiencing an event 103 / 191

104 For censored individuals S(t l λ l) = J pj exp( t α λ j,l) t l = censoring time Since θ is unobserved, the p j s and λ s can t be separated, and the previous method of proportional hazards is invalid. 104 / 191

105 Direct Estimation of Time-Dependent Models with Unobserved Heterogeneity The problems of estimating such a model with no a priori restrictions on heterogeneity, such as whether dµ(θ) is unimodal, has finite mean or any other moments, is apparent by examining the absolute simplest model conceivable. Imagine there are only two types of individuals. Their values of the unobservable are θ 1 and θ 2. Their proportions in the sample population are p and 1 p. 105 / 191

106 The data may then be confronted with an integrated survivor function ( S(t X ) = p exp H(t) exp(x β + θ 1 ) ) +(1 p) exp( H(t) exp(x β + θ 2 )) There are three parameters to be assigned values relating to the heterogeneity: θ 1, θ 2, p. These are in addition to the vector β and whatever α is implicit in the time dependence specifications H(t). 106 / 191

107 1-p p / 191

108 For three support points, we have S(t X ) = 3 k=1 p k exp( H(t) exp(x β + θ)). For any n support points, 2n 1 free parameters are introduced. 108 / 191

109 It is then no surprise that in order to distinguish between survivor functions, which are all monotone decreasing functions of time, across separate possible combinations of θ mixtures, Monte Carlo studies indicate that the size of the sample must be on the order of 20,000 observations. One Monte Carlo study used a sample created using a gamma distributed unobservable. The nonparametric maximum likelihood technique was poor at reaching any discrete mixture approaching the shape of the continuous gamma function. For example, the next figure was not forthcoming. 109 / 191

110 p / 191

111 However, the data did yield estimates of the covariates β which were consistent with the actual ones used to create the sample. 111 / 191

112 The E. M. (Expectation - Maximization) algorithm involves two separate steps. E-step: p (m+1) j = N l=1 p(m) j f (t l ;λ (m) j,l ) P J j=1 p j f (t l ;λ (m) j,l ) = N M-step: L = N J n=1 j=1 ln f (t l; λ j,l ) l=1 p(m) j φ (m+1) j,l 112 / 191

113 The procedure is as follows: 1 Select starting value: (p (0) 1... p (0) 2 Form φ (1) j,l = f (t l ;λ (0) j,l ) P J p (0) j f (t l ;λ (0) ). j,l J ), (θ(0) Substitute into L (1) = ln f (t l ; λ j,l )φ (1) j,l and maximize L to obtain α (1), β (1), θ (1). 3 Calculate p (1) j, a weighted average of p (0) j θ (0) J ), α(0), β (0). 4 Repeat from step 2, with φ (2) j,l, and L(2) to convergence. 113 / 191

114 Demster, Laird and Rubin show this procedure does indeed converge under exceedingly weak assumptions on f. Problems do arise with this technique: the likelihood function tends to have bumps - the convergence may yield a local rather than a global maximum. This technique can be very slow to converge over flat portions of the likelihood. 114 / 191

115 In practice, it makes sense to combine this method with a steepest ascent route. Also, this is an unbalanced problem in that some part (time dependency and observed covariates) is parametric and the other part (unobservables) is non parametric. Consistency of an estimate of m(θ) is available only for huge data sets (10k), whereas convergence is quicker for α and β. 115 / 191

116 The recoverability of m(θ) is difficult for even the simplest form of the survivor function. Take Observable data is on S(t) = S(t θ) = exp( θt). 0 exp( θt) dµ(θ). 116 / 191

117 A Laplace transform of the probability distribution µ(θ) is invertible if S(t) is known for continuous time. Very small changes in S(t) will yield large differences in dµ(θ). The analytical problem yields only approximations of S(t) for some time t. 117 / 191

118 Using the EM algorithm often yields a mixing distribution with only 3-8 points. A spiked estimated mixing distribution is likely an approximation to a true distribution with more variance: the data not being rich enough to distinguish between all such grouped peaks. 118 / 191

119 Given a discrete mixing distribution, there is thus a finite numbers of durations: g(t θ) = θ exp( θt) (exponential) g(t θ) = αt α 1 exp(θ) exp( t α exp θ) ḡ(t) = (observed) g(t, θ) dµ(θ) (Weibull) (mixed density) 119 / 191

120 An interesting mathematical result on variation diminishing transformations allows the analyst to infer information regarding the mixing distribution from the histogram of the durations. If g(t θ) exhibits sign regularity, then the number of times that the histogram ḡ crosses an arbitrary constant c must exceed the number of times that the mixing distribution crosses the same constant function c. g m(2) c c t / 191

121 Formally, sign regularity requires that for all values t 1 < t 2, θ 1 < θ 2, either ( ) g(t1 θ det 1 ) g(t 1 θ 2 ) 0 g(t 2 θ 1 ) g(t 2 θ 2 ) Or that the same determinant is less than zero for all t 1 < t 2, θ 1 < θ 2. Every member of the exponential family of functions exhibits this property. The variation diminishing property is { number of sign changes in m(θ) c for θ R} { number of sign changes in g (t) c for t R} 121 / 191

122 This not only determines the minimum number of modes that m(θ) may have, but gives local information on amplitude, since c may be chosen arbitrarily. 122 / 191

123 Tests which are insensitive to the specification of heterogeneity An important question given these results which show sensitivity of estimators to heterogeneity specification is what simple tests are reliable indicators of the basic structural form of the time dependence in a duration model. 123 / 191

124 For example, one study by Chahnazarian, Menken and Choe shows that a commonly used practice of picking an arbitrary time segmentation on a duration model in order to run logistic regressions on transitions at say 6 month intervals is quite sensitive to that arbitrary choice. They show that covariate coefficients vary qualitatively depending on the choice of time segmentation. The lesson, one might say, is to devote some effort to modeling the underlying dynamic process before seeking to obtain results on what variables affect that process! 124 / 191

125 Thus, some preliminary tests are in order to investigate what forms of switching processes are consistent with the data. First, discuss properties of Markov switching processes in discrete time. Following this, I present a test which can reject a broad class of models all based on the notion of partial exchangeability. 125 / 191

126 Finally, this is extended to a continuous time model of a particular form of separability, namely where switching times are given by a Poisson process (i.e. exponential waiting times) but where the switches across states depend only on the current state (Markovian). 126 / 191

127 Discrete time switching models The following discussion of Markov processes leads to an indication of the assumptions that must be made in order to test duration models when only prospective point sampling data is available, not event history data. Let the following diagram represent transitions between two states, represented by the two points 0 and 1. The dashed red line is one individual s history, the solid line is the other s. 1 0 t 127 / 191

128 The definition of the Markov property is P{X (k ) = i k X (k 1) = i k 1, X (k 2) = i k 2,..., X (0) = i 0 } = P {X (k ) = i k X (k 1) = i k 1 }, for k = 1, 2,... k, where k = 0 is the first survey. Time homogeneity requires that P{X (k ) = 1 X (k 1) = i k 1 } is the same for all sampling times k. 128 / 191

129 Suppose that one only has two surveys. All that may be estimated is P(X ( ) = i 1 X (0) = i 0 ) = m i0,i 1. A 2 2 transition matrix which summarizes probabilities assigned to the possible transitions (0, 0), (0, 1), (1, 0), (1, 1) at the survey times k = 0, k = : ( ) a 1 a P(0, ) = M = 1 b b 129 / 191

130 To examine the properties of such a Markov process, assume that we have a discrete time event model where the transition times are identified with the sampling times. Assume that switches occur just before the sampling times. Let the transition matrix have the actual values ( ) 1/4 3/4 M =. 5/8 3/8 If switches occurred twice as often, say at times 0, /2, and, then in order for the property of time stationarity to hold, there must be a new double-time transition matrix M 0 such that M 0 M 0 = M 130 / 191

131 Similarly, if time stationarity is to hold with sampling and switching times at 0, /3, 2 /3,, then M 0 M 0 M 0 = M For the above value of M, the first condition does not hold true, but the second does. As the number of matrices changes, the existence of the roots changes. 131 / 191

132 A Continuous Time Switching Process Assume that two rates describe waiting times for the transitions from state 0 to state 1 and vice versa, and that transition probabilities are independent across spells, P(T 0 > s) = exp( r 0 s) P(T 1 > s) = exp( r 1 s). 132 / 191

133 A basic test of this specification, as before, is a graph of the log of present of the population surviving, against time: one should observe linear functions. -ln S(t) r t 0 -ln S(t) r t 1 Duration of spell in state 0 Duration of spell in state / 191

134 The transition rates may be summarized by the matrix R: ( ) r0 r R = 0 r 1 r 1 Reducing R to eigenvalue form: λ 1 0 R = H... H 1 0 λ s exp(λ 1 ) 0 exp( R) = H... H 1 0 exp(λ s ) Here = length of time between observations. 134 / 191

135 Taking logs, R = ln M or 1 ln M = R. If this data allows an R matrix with real roots, ( r0 ) r 0 r 1 r 1 (4) then this model has probabilistic interpretation λ 1 0 M = B... B 1 0 λ s 135 / 191

136 In general, ln M = B(ln λ 1 +i(arg(λ j ) + 2πk))B 1 where the right hand term is the polar decomposition, and where ( ) a 1 a M = 1 b b 1 ln(a + b 1) ln M = a + b 2 ( a 1 ) 1 a 1 b b / 191

137 If a + b > 1, then the observations are compatible with a unique continuous time transition model with transition rates r 0 and r 1 and r 0 = r 1 = (1 a) ln(a + b 1) a + b 2 (1 b) ln(a + b 1) a + b 2 If a + b < 1, then ln(a + b 1) is complex and there is no probabilistic interpretation of the R matrix. 137 / 191

138 Suppose that the transition rates are dependent on some observables x : r 0 = c 0 exp(xβ) r 1 = c 1 exp(xβ) What happens when this form is used on discrete sampling data? [ ( )] c0 exp(x β) c P(0, ) = exp( R) exp 0 exp(x β) c 1 exp(x β) c 1 exp(x β) ( ) m00 m = 01 m 10 m / 191

139 The likelihood is then = N i=1 m δi mδi mδi mδi N (exp( R(xβ))) δ (exp( R(x β))) δ i=1 (exp( R(xβ))) δ (exp( R(x β))) δ Reference: J. Cohen and B. Singer Malaria in Nigeria: Constrained continuous time Markov models for discrete-time Longitudinal Data Edited by S. Levin. Lectures in Mathematics in Life Sciences. American Mathematics Society pp69-133, / 191

140 Nonstationary Continuous Time Models and Discrete Time Data Here we ask when transition matrix M = ( a 1 a 1 b b based on observations in two states, is consistent with a non-stationary continuous time model. ), 1) 2) P(s, t) = P(s, t)r(t) t P(s, t) = R(s)P(s, t) s P(t, t) = I P ij (s, t) = P(x(t) = j X (s) = i) ( ) r0 (t) r R(t) = 0 (t) r 1 (t) r 1 (t) for s < t 140 / 191

141 Nonstationary Continuous Time Models and Discrete Time Data In the R matrix, r(t) can be any non-negative function of t. Solutions to the differential equations (1) and (2) are conditional probabilities. There exists a P(0, ) iff (a + b) > 1. Given this condition holds, a problem with identification still needs to be solved. Assumptions or indications fro theory need to be applied to parametrization. Ref. GS Goodman 1970, An Intrinsic Time Model for Nonstationary Markov Chains, Zeitschrift für Wahrscheinlichkeitstheorie. 141 / 191

142 In the case where continuous event history data is available, the nonstationary Markov property can be tested and the nonstationary may be characterized. First using the language of a counting process. N (t) = (N 01 (t), N 10 (t)) = cumulative transitions cumulative transitions from 0 1in time ; from 1 0 in time 0 s < t 0 s < t 142 / 191

143 If the Markov property holds up, then by examining the 0 1 transitions, t 0 r 0 (u) du = k:t k t 1 Y 0 (t 01 k ), where Y 0 (t) is the number of individuals at risk in state 0 at time t. 143 / 191

144 For separate time periods t 0, t 1,, this gives rise to an estimated curve as below, Z t r 0 (u) du 0 t " t 0 t 1 t 2 t which conforms to a Weibull function. 144 / 191

145 If then t 0 r 0 (u) du = t α r 0 (t) = αt α 1. Notice that the estimation above applies the same Aalen s theory of counting processes presented at the beginning of these notes. The variance is [ t ] Var r(u) du = 0 k:t 0,1 k t [ 1 Y 0 (t (0,1) k ) ] 2. This technique cannot be applied with point sampling data. 145 / 191

146 An example of the theoretical restriction that must be imposed with point sampling data follows. It can be of a relatively simple form, as in the study of malaria: the climate s effect on the mosquito population led to the restrictive presumption that the winter incidence rates of malaria would be distinctly different than the summer s. { R 1 for 0 < t < /2 R(t) = for /2 < t < M = R 2 P(0, /2) = exp( /2R 1 ) P( /2, ) = exp( R 2 ) ) = ( a 1 a 1 a a ( α 1 α 1 α α = M 1 M 2 ) ( β 1 β 1 β β ) 146 / 191

147 In order for the transitions over a year to have a Markovian probabilistic interpretation, a > 1/2, and for the 6 month seasons, a > 1/2, β > 1/2. From the condition that M = M 1 M 2, it follows that a = αβ + (1 α)(1 β). Graphically, this restriction requires (α, β) be of a lows of points for each possible a. With data from point sampling available at six month intervals, the estimated â, ˆα, and ˆβ can be used to test this restriction. 147 / 191

148 Time Aggregation There are dangers in using comparisons between estimated discrete time Markov transitions and continuous time switching probabilities to draw conclusions regarding the dynamics of the process. For example, comparing the sum of the diagonal elements of the matrix M k to the sum of the diagonals of P(0, k ) may be interpreted as an examination of those who stay in the originating state over all time periods k and those who begin and end in the same state, but who may make transitions in and out in the intervening periods. 148 / 191

149 Since the second condition seems a weaker condition, one might expect the following regularity: tr(m k ) < trˆp(0, k ). This cannot be claimed as certainty, since the Markov model applied to discrete sampling points would not capture asynchronous switching times. Suppose that population were not homogeneous, but a simple mixture of movers and stayers. Stayers making up {s : 0 s < 1} proportion of the population. 149 / 191

Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis

Stock Sampling with Interval-Censored Elapsed Duration: A Monte Carlo Analysis Michael P. Babington and Javier Cano-Urbina August 31, 2018 Abstract Duration data obtained from a given stock of individuals