Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones : Details.

Size: px

Start display at page:

Download "Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones : Details."

Daniella Nelson
5 years ago
Views:

1 Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones : Details. Peter J. Bickel Ya acov Ritov Tobias Rydén Abstract We consider the log-likelihood function of hidden Markov models, its derivatives and expectations of these (such as different information functions. We give explicit expressions for these functions and bound them as the size of the chain increases. We apply our bounds to obtain partial second order asymptotics and some qualitative properties of a special model as well as to extend some results of Petrie s (1969. Key words and phrases: Hidden Markov model, incomplete data, missing data, asymptotic normality. AMS 1991 classification: 62M9 1 Introduction In two previous papers, Bickel and Ritov (1996 and Bickel, Ritov and Rydén (1998, we developed the theory of inference for hidden Markov models (HMMs when the state space of the chain is finite but the hiding mechanism is not finitely-valued. We recall the definition of these models. An HMM is a discrete-time stochastic process {(X t, Y t } such that (i {X t } is a Markov chain, and (ii given {X t }, {Y t } is a sequence of conditionally independent random variables with the conditional distribution of Y n depending on {X t } only through X n. The distribution of {Y t } is assumed to depend smoothly on a Euclidean parameter θ. Equivalently an HMM can be thought of as Y n = h(x n, ε n, where {ε t } is an i.i.d. sequence independent of {X t }, that is a stochastic function of X n. The parameter mentioned before labels both the transition probability matrix of the chain, the function h and the distribution of {ε t }, although in principle the latter can be taken as fixed, The research of the first two authors was partially supported by NSF grant DMS and by US/Israel Bi-National Science Foundation grant 9-31/2. The third author was supported by a Swedish Research Council grant. Department of Statistics, University of California, Evans Hall, Berkeley, CA 9472, USA Department of Statistics, The Hebrew University, Jerusalem 9195, Israel Centre for Mathematical Sciences, Lund University, Box 118, S-221 Lund. Sweden 1

2 for example uniform on (, 1. In this generality, HMMs include state space models, cf. Kalman (1977. We, in this paper as before, restrict ourselves to the case where the state space of the {X t } is finite. HMMs have during the last decade become wide spread for modeling sequences of weakly dependent random variables, with applications in areas like speech processing (Rabiner, 1989, neurophysiology (Freaking and Rice, 1992, and biology (Leroux and Puterman, See also the monograph by MacDonald and Zucchini (1997. Inference for HMMs was first considered by Baum and Petrie, who treated the case when {Y t } takes values in a finite set. In Baum and Petrie (1966, results on consistency and asymptotic normality of the maximum-likelihood estimator (MLE are given, and the conditions for consistency are weakened in Petrie (1969. Baum and Petrie and particularly Petrie also studied the structure of the map ϑ E ϑ log p ϑ (Y 1 Y, Y 1,..., the conditional limiting entropy per observation. The results of Bickel et al. (1998 depend critically on bounds on derivatives of the loglikelihood of the observations. Specifically, we showed that under Cramér-type conditions at ϑ, { } k sup ϑ k log p ϑ(y 1,..., Y n : ϑ ϑ ε M k (Y 1,..., Y n, (1 where E ϑ M k (Y 1,..., Y n C < for k = 1, 2, as well as quasi-continuity of the map ϑ ( 2 / ϑ 2 log p ϑ (Y 1,..., Y n at ϑ. The same type of bounds were applied to state space models by Jensen and Petersen (1999. Similar bounds were obtained and used in the case where Y is finitely supported by Baum and Petrie (1966 and Petrie (1969. They were actually able to show that if ϑ is the transition probability matrix of the Markov chain (and when Y is finitely supported this is the most general model and Θ = {ϑ : ϑ ij δ >, i, j}, then the map ϑ E ϑ log p ϑ (Y 1 Y, Y 1,... has a convergent series expansion everywhere on Θ. Our technical goals are threefold: (i To exhibit bounds on the derivatives of the form (1 and to show that, under some conditions, M k grows no faster than nc k k!. We shall derive these bounds by a unified argument relying on results of Saulis and Statulevičius (1991 (henceforth referred to as S&S. (ii To obtain bounds on { ( } k m sup ϑ k E ϑ ϑ m log p ϑ(y 1 Y, Y 1,.... We give conditions under which this expression is bounded by C k+m (k + m!. 2

3 (iii To establish bounds on k ( ϑ k log p ϑ (Y 1 Y, Y 1,..., Y t log p ϑ (Y 1 Y, Y 1,..., Y s of the form Cρ t s for ρ < 1. Using these results, we shall, under suitable conditions (a Show how to establish stochastic asymptotic expansions for the MLE in terms of derivatives of the loglikelihood at ϑ. We sketch how in conjunction with Edgeworth type expansions for sums of functions of Markov chains, these establish the validity of procedures such as debiasing the MLE and other second order methods as available in the i.i.d. case. (b Show that the following functions are analytic. The Fisher information { } 2 I(ϑ = E ϑ ϑ 2 log p ϑ(y 1 Y, Y 1,..., the Kullback-Leibler distance { K(ϑ = E ϑ log p } ϑ (Y 1 Y, Y 1,... p ϑ and the entropy H(ϑ = E ϑ {log p ϑ (Y 1 Y, Y 1,...}. (c Study the behavior of these functions and their derivatives at points ϑ under which the X s are i.i.d. (the transition probability matrix is degenerate. We show that at such points, in principle, these quantities can be computed explicitly. (d Show how to use these expansions qualitatively to guess properties of I(ϑ which can be established in other ways. 2 Assumptions and main results We observe Y -valued random variables Y 1,..., Y n, where Y is a general space, distributed as follows. We let {X t } n t=1 be a stationary Markov chain with state space {1, 2,..., R} and transition probability matrix A ϑ = {α ϑ (, }. Then, Y 1,..., Y n are conditionally independent given X 1,..., X n and the conditional distribution of Y t depends on X t only. We 3

4 assume that these distributions have densities g ϑ (y x, where y Y and x {1,..., R}, with respect to some common σ -finite measure on Y. Thus, given X t = i, Y t has conditional density g ϑ ( i. We assume that {X t } is ergodic so that the stationary distribution exists and is unique. We denote it by π ϑ (. The parameter ϑ lies in ϑ Θ R d, where Θ is open. All computations will be done under a particular value of ϑ, denoted by ϑ. We write Ys t for (Y s,..., Y t and define X t s similarly. We can and will embed Y 1,..., Y n into {Y t } t= related to {X t } t= by the mechanism described above. When this is done, we write Y t for (..., Y t 1, Y t and define X t similarly. Probabilities and expectations will be denoted by P and E, respectively, and the conditional expectation given some random variable Y by E Y. Likelihood ratios (with respect to ϑ will be denoted by L and loglikelihood ratios by l. Note that we use the same characters to denote different functions; the specific function will be clear from the argument. If the value of the parameter is ϑ, we often replace it by. Thus L ϑ (Y1 n is the likelihood ratio of Y1 n calculated at ϑ, whereas l (X, Y is the loglikelihood ratio of X and Y (being some general random variables calculated at ϑ. To further shorten the notation, we let h t (ϑ = l ϑ (X t, Y t X t 1 1, Y1 t 1 for t 1. Thus, by the very definition of an HMM, h t (ϑ = log α ϑ (X t 1, X t +log g ϑ (Y t X t for t > 1 and h 1 (ϑ = log π ϑ (X 1 +log g ϑ (Y 1 X 1. If a = (a 1,..., a d is a d-dimensional vector with non-negative integer entries, then D a denotes the corresponding partial derivative a + /( a 1 ϑ 1 a d ϑ d, where a + = d 1 a i. We call such a a multi-index. Furthermore we define C k (y = sup max max{ D a log α ϑ (i, j + D a log g ϑ (y i + D a log π ϑ (i }, (2 a + =k i,j ϑ V with V being a neighborhood of ϑ and the outer maximum being taken over all multiindices a with a + = k, and r p t B k = max E C ji (Y t j t=1 i=p t 1 +1 i! X t = x t : 1 m k, 1 r m, } j 1,..., j m 1, j i = k, = p < p 1 <... < p r = m, x 1,..., x r {1,..., R}. The last quantity measures how big, in expectation, we can make a product of partial derivatives of the h t by distributing a total of k derivatives and possible time indices over different components of the parameter and time, respectively. In particular, if C k (y = max{c m (y : 1 m k}, then B k max{e (C k (Y 1 X 1 = x m : x {1,..., R}, 1 m k} C k, 4

5 if all C k C <. On the other hand, if the Y t are mutually independent, then B k k m=1 ( 1 max{e (C m (Y 1 X 1 = x : x {1,..., R}}. We now state the main assumptions being used in this paper. (A1 The entries of transition probability matrix A ϑ are bounded away from zero on V. (A1 Condition (A1 holds for all ϑ. (A2k For all i, j, ϑ log α ϑ (i, j and ϑ log π ϑ (i has k continuous derivatives in the neighborhood ϑ V of ϑ, and for all i and y Y, ϑ log g ϑ (y i has k continuous derivatives in the same neighborhood. (A2 All log α ϑ (i, j and log π ϑ (i and all their derivatives are uniformly bounded. (A3k B k <. (A3 All derivatives of log g ϑ (y i are uniformly bounded in y. We now state our main results. Theorem 2.1. Assume (A1, (A2k and (A3k hold. Then for all multi-indices a with a + = k, E sup D a l ϑ (Y n 1 C 1nB k C2 k k! ϑ V for some C 1 and C 2 that depend on the transition probability matrix A ϑ, ϑ V, only. Theorem 2.2. Let a and b be multi indices with a + = k and b + = m, respectively. Then the following assertions hold true. (i Under (A1, (A2k and (A3k, E D a l (Y 1 Y n C 1 B k C k 2 k! where C 1 and C 2 depend on the transition probability matrix A only. (ii Under (A1, (A2 and (A3, if n t then D a P (X 1 = x Y n D a P (X 1 = x Y t C 3 ρ t, ρ < 1 where C 3 depends on the uniform bound on all derivatives and A. (iii Under (A1, (A2k, and (A3k, E D a l (Y 1 Y nd b L (Y 1 n C 3 k!m!. Remarks. 5

6 1. The bounds in (i and (iii are the same as in the case {Y i } i.i.d. 2. The bound of (ii expresses a strong mixing property. We could prove versions of (ii and (iii under (A1, (A2k, and (A3k, but the results are technically complicated and we leave them to the reader. Theorem 2.3. Under (A1, (A2 and (A3 the following assertions hold true. (i D a l (Y 1 Y n converges P -a.s. to D a l (Y 1 Y. (ii E (D a l (Y 1 Y nd b L (Y 1 n converges to an appropriate limit as n. (iii n 1/2 (D a l (Y n 1 E D a l(y n 1 converges weakly to a N(, Var (D a l (Y 1 Y distribution under P. 3 Proofs of main results Throughout the remainder of the paper we shall assume that the parameter space Θ is one-dimensional, that is d = 1. This causes no loss of generality, but simplifies the notations as we do not need to work with mixed partial derivatives. Derivatives with respect to ϑ of order k will be denoted with superindex k, for example L (k ϑ. In many of the proofs, cumulants play a major role. Let Z 1,..., Z k be k random variables. We denote their cumulant by Γ(Z 1,..., Z k = 1 ι k ( log Ee ι(u 1Z 1 + +u k Z k u1, (3 u 1 u k =...=u k = where ι = 1. The cumulant is a multilinear function (that is, it is linear in any of the random variables if all other variables are kept fixed and, in particular, if Z 1 =... = Z k then Γ(Z 1,..., Z k is the standard kth cumulant of Z 1. Finally, if Z = (Z 1,..., Z k we write Γ(Z = Γ(Z 1,..., Z k and Γ W (Z for the cumulant of the conditional distribution of Z given W. For the proof of Theorem 2.1 we proceed as follows. (i We write l (k (Yn 1 as a linear combination of conditional cumulants of n t=1 h(j t (ϑ, 1 j k, by means of a general formula valid for any latent variable model, see (4. (ii By a generalization of results by S&S we give a bound on the individual conditional cumulants of the form (1 to yield the result. 6

7 For the proof of Theorem 2.2(i, we use the same formula expressing l (k (Y 1 Y n as l (k (Y1 n l (k (Y n and analyzing it in the same way as in Theorem 2.1. For part (ii we use a further decomposition into so-called centered moments; see below and part (ii of Theorem 2.2. We also relate L (m to the h (i, 1 i m, by another general formula. For the general formula, we need additional notation. Let Z + be the set of positive integers. We let J = {(J 1,..., J k : k >, J j Z +, j = 1,..., k}. For J J, J denotes the dimension of the vector J and J + = J j. We define the following subsets of J : J (k = {J J : J = k} and J + (k = {J J : J + = k}. Another useful set of integers is J n m(k = {(I 1,..., I k : I i Z +, m I i n, i = 1,..., k}. For any integer vector I as above, min I = min I i, max I = max I i and (I = max I min I. Furthermore, if a 1, a 2,... is any sequence and I = k, then a I = (a I1,..., a Ik. An operation between two sequences is done term wise, so that a I /b I = (a I1 /b I1,..., a Ik /b Ik. In general any operation is meant to be term by term, so J! = (J 1!, J 2!,..., ai = k a I i and a very typical expression in this paper is f (J I J! ( (J f 1 I = 1 J 1!,..., f (J k I k. J k! We now let X and Y be any random vectors such that only Y is observed with X being missing at random. Proposition 3.1. Suppose the loglikelihood ratio of the full data model l ϑ (X, Y is k times differentiable. Then the loglikelihood ratio of the observable model is k times differentiable and l (k (Y = ( k! l (J J! ΓY (X, Y, (4 J! J J + (k L (k (Y = J J + (k k! (J l (Y. (5 J! J! The first of these formulae may be viewed as a generalization of results of Louis (1982 and Meilijson (1989, relating the score function and observed information of the observable vector Y to those of the full model. The above theorem provides results also for 7

8 higher order derivatives. Both of the statements of the theorem are closely related to the exlog relations in Barndorff-Nielsen and Cox (1989, pp. 14. For the proof we need the following form of Faa di Bruno s formula, whose proof is in the Appendix. Lemma 3.1. (i For any functions f : R R and h : R R with h( = and f and h being k times differentiable, k ( ϑ k f(h(ϑ k = k ϑ= ϑ k f ϑ i h (/i! (i. i= ϑ= (ii If f : R k R, then k ϑ k f(ϑ, ϑ2 /2,..., ϑ k /k! = ϑ= J J + (k k! J! J f(u 1,..., u k J! uj1 uj J u1 =...=u k =. Proof of Proposition 3.1. We start with the representation l ϑ (Y = log E Y e l ϑ(x,y. Note that for random variables whose joint moment generating function exists in a vicinity of, the joint characteristic function in the definition (3 of the cumulant can be replaced by the joint moment generating function, and the factor ι k in the denominator then also disappears. Hence J P log E Y u e i W i = Γ Y (W J1,..., W J J = Γ Y (W J. J1 J J ui =...=u J = Now apply the first part of Lemma 3.1 to obtain { k } l (k k (Y = ϑ k log EY exp ϑ i l (i ϑ= (X, Y/i!. Apply the second part of the lemma to this expression with { k } f(u 1,..., u k = log E Y exp u i l (i (X, Y to see that l (k (Y = J J + (k k! J! J! ΓY (l (J (X, Y. 8

9 The product J! can be taken inside Γ because of the multilinearity of the cumulant function and the proof of the first part of the lemma is complete. Similarly, L (k k (Y = ϑ k el ϑ(y k = k ϑ= ϑ k exp ϑ j l (j (Y/j! = ϑ= j=1 J J + (k k! (J l (Y J! J! by Lemma 3.1 with f(u 1,..., u k = exp{ k j=1 u jl (j (Y}. The next lemma requires introduction of so-called centered moments and notation of mixing. For any random variables Z 1, Z 2,..., let χ (Z 1 = Z 1 and χ(z 1 = EZ 1, and define recursively χ (Z 1,..., Z k = Z 1 (χ (Z 2,..., Z k χ(z 2,..., Z k χ(z 1,..., Z k = Eχ (Z 1,..., Z k. χ is called the centered moment function (S&S, p. 12. For example, χ(z 1, Z 2, Z 3 = E(Z 1 Z 2 Z 3 E(Z 1 E(Z 2 Z 3 E(Z 1 Z 2 E(Z 3 + E(Z 1 E(Z 2 E(Z 3. (6 Similar to the notation for cumulants, if Z = (Z 1,..., Z k then χ(z = χ(z 1,..., Z k and χ W (Z is the centered moment of the conditional distribution of Z given W. Let Z t = g t (T t for some measurable functions g t, where {T t } t= is Markovian and obeys the following mixing condition in terms of constants ϕ t, < t <. If F m is the σ -field generated by Z t, < t m, and F n is the σ -field generated by Z t, n t <, then for all m < n, sup{ P (B A P (B : A F m, B F n, P (A > } n t=m+1 ϕ t. (7 Lemma 3.2. With {Z t } as above, assume Z t C t a.s., 1 t n, and let 1 t 1 t 2... t k n. Then χ(z t1,..., Z tk 2 k 1 k j=1 C tj t k j=t 1 +1 ϕ j. 9

10 Proof. This is essentially Theorem 4.4 of S&S. The only difference is that we allow different bounds C t on the Z t ; the validity of this extension follows easily as multiplicative constants can be moved in and out of centered moments. We now express the cumulants of sums in terms of centered moments, a generalization of a formula of S&S. Let W 1, W 2,... be random vectors, i.e. W i = (W i,1, W i,2,... etc. If J J is a set of indices, W i,j denotes the vector with elements W i,j, j J. Lemma 3.3. (i The multivariate cumulant Γ( can be expanded and bounded as ( n Γ W i,j = Γ(W I,J I J1 n( J n I J n 1 ( J min I=i J ν=1 K q ={1,..., J } M ν (K 1,..., K ν ν χ(w I(Kq,J(K q, where the inner sum is over all partitions K 1,..., K ν of the set {1,..., J }, I(K q = (I Kq,1, I Kq,2,... and J(K q is defined similarly. The M ν are non-negative combinatorial constants satisfying, in particular, that M ν ( > implies ν (I(K q (I. (ii For all i and ρ < 1, I J n 1 ( J min I=i J ν=1 K q ={1,..., J } M ν (K 1,..., K ν ρ P ν (I(K q ( 4 J 1 J!. 1 ρ To clarify the notation once more we remark that Γ(W I,J = Γ(W i1,j 1,..., W il,j l, where l = I = J. Proof. The multilinearity of the cumulant function is one of its basic properties. The bound in (i comes from S&S Lemma 1.1, where also the property of the M ν s is found. For part (ii, note that the proof of S&S Lemma 4.6 starts with their (4.55, which is equivalent to the expression in part (i of the lemma. Then, in S&S s notation, we use C = C 2 = u = 1, f(s, t = ρ t s and the bound ρ (Ip on χ(w Ip,J p ; the result now follows from (4.6 in S&S. We remark that we tacitly assume that for any cumulant Γ(W I,J, the vectors I (and J are rearranged so that the elements of I become sorted in non-decreasing order 1

11 before the cumulant is expanded into centered moments as in the above lemma. Since cumulants are invariant with respect to permutations of the random variables involved, such a rearrangement does not change the value of the cumulant, but it is necessary as we want to apply results like Lemmas 3.2 and 3.4 which do require sorted time indices. We shall now examine the mixing condition (7 for HMMs and identify the ϕ s in this particular case. Define ρ by ( 1 ρ = inf ϑ V min α (i, j min i,j i,j α (i, j with α (i, j = π (j/π (i α (j, i; note that α (i, j are the transition probabilities of the time-reversed Markov chain. Under (A1, ρ < 1. Generally, if A ϑ is ergodic there is an m, m R, such that all m-step transition probabilities are positive. Assuming m = 1, it holds that if H t is a set defined in terms of X u and Y u, t u n only, then for s < t, max P ϑ (H t Y1 n, X s = i min P ϑ (H t Y1 n, X s = i ρ t s. (8 i i This result is proved in Douc, Moulines, and Rydén (21, Corollary 1. A simple conditioning argument then yields max P ϑ (H t Y1 n, X s = i P ϑ (H t Y1 n ρ t s for all ϑ V, (9 i which is our particular version of (7. Note that we work with the conditional Markov chain X Y (it is straightforward to verify that the conditional process is still Markov, although non-homogeneous, because in view of Proposition 3.1 we want to examine conditional cumulants. When Y1 n is fixed we can identify ϕ t in (7 with ρ in (9. Following Bickel et al. (1998, Lemma 5, one can also prove that for 1 s t 1, sup P ϑ (X s A Y1 t P ϑ (X s A Y1 t 1 ρ t 1 s for all ϑ V. (1 A {1,...,R} In the case m > 1, an inequality similar to (8 still holds true but with the bound on the conditional mixing now depending on the Y t. This causes an additional degree of difficulty in our subsequent arguments and we do not treat this case. The next result now follows from the above and Lemma 3.2. Lemma 3.4. Let K q = (K q,1,..., K q,l be an element of a partition as in Lemma 3.3. Then l χ Yn 1 ϑ (ϑ 2 l 1 C JKq,j (Y IKq,j ρ (I(Kq for all ϑ V, (h(j(kq I(K q where I(K q = (I Kq,1,..., I Kq,l etc. and C k (y is defined in (2. j=1 11

12 Proof of Theorem 2.1. First note that we may replace ϑ by any ϑ in Proposition 3.1 without changing the definition of h t, as only derivatives of this function appear in the proposition. Hence Expand Γ Yn 1 ϑ bound ν l (k ϑ (Yn 1 k! ( n χ Yn 1 ϑ h(j i (h(j(k q I(K q J J + (k 1 J! 1 ΓYn ϑ ( n h(j i J! (ϑ for all ϑ V. (ϑ as in Lemma 3.3(i. We can employ Lemma 3.4 to obtain the (ϑ ν K q 2 K q 1 j=1 J = 2 J ν j=1 C JKq,j (Y IKq,j ρ (I(K q C Jj (Y Ij ρ P ν (I(K q for all ϑ V. (11 Taking the expectation of this bound and letting i 1 < i 2 <... < i r denote the distinct points of the vector I, the structure of an HMM yields E sup ν χ Yn (J(K 1 (h q I(K ϑ V q (ϑ J 2 J ν E C Jj (Y Ij ρp ν (I(Kq j=1 r 2 J ν E E C Jj (Y i l l=1 j: I j =i Xn 1 ρp ν (I(K q l r = 2 J ν E E C Jj (Y i l l=1 j: I j =i X i l ρp ν (I(K q l r 2 J 1 max E C Jj (Y i x l X i = x P ν ρ (I(Kq. (12 l l=1 j: I j =i l Multiplying by 1/ J! and using Lemma 3.3(ii we obtain { ( n } E sup 1 h(j i (ϑ n ( 8 J 1 ( 8 J 1 ΓYn ϑ B J! k J! = nb k J!. 1 ρ 1 ρ ϑ V 12

13 Hence E sup l (k ϑ (Yn 1 nk!b k ϑ V J J + (k ( 8 J 1 ( 8 nk!b k k 1, 1 ρ 1 ρ 1 ρ where the last inequality is from Lemma 4.1(ii in the Appendix. The next lemma is needed for the proof of Theorem 2.2. Lemma 3.5. Let a a b b and suppose that W measurable w.r.t. the sigma-field generated by Ya b. Then { } { } E W L (m (Ya b = E W L (m (Ya b, m =, 1, 2,... The proof is given in the Appendix. Proof of Theorem 2.2. In this proof we again use the notation of Lemma 3.3. and drop the argument ϑ of the function h as this parameter stays fixed throughout the proof. Moreover, since this lemma and other ones are formulated in terms of positive time indices we shift the indices of the statement of the theorem and set out to prove E l (k (Y n Y n 1 1 C 1 B k C k 2 k!. By Proposition 3.1 and multilinearity of the cumulant function, l (k (Y n Y1 n 1 = l (k (Yn 1 l (k = J J + (k = J J + (k k! J! k! J! (Yn 1 1 { { Γ Yn 1 Γ Yn 1 + J J =J J J ( n 1 h(j i J! ( n 1 h(j i Γ Yn 1 J! ( n 1 + h (J n h(j i J! Γ Yn 1 1 Γ Yn 1 1 ( n 1 h(j i ( n 1 h(j i J! } J! } h(j, n J, (13! where the last sum is over all partitions (J, J of the set J except (J, J = (J,. This partition is excluded since it is the first sum of the right hand side and will be compared to the second one. Clearly, there are two types of cumulants here. The first two ones are similar and their difference will be shown to remain bounded in expectation as n. 13

14 The last sum involves cumulants that contain at least one h n, and this is sufficient to keep them bounded as n. We start by considering where we assume that ν γ = χ Yn 1 (h (J(K q I(K q ν ν (I(K q (I. χ Yn 1 1 (h (J(K q I(K q This difference can be bounded in two ways. First, each term of the difference can be bounded separately. Arguing as for (11, we obtain j=1 γ ν χ Yn 1 (h (J(K q I(K q J 2 2 J 1 Secondly, we can write J ν γ = C Jj (Y Ij j=1 + ν, χyn 1 1 (h (J(K q I(K q C Jj (Y Ij ρ P ν (I(K q. (14 χ Yn 1 (W I(Kq,J(K q where W ij = h (j i /C j (Y i. Consider the scheme m m B i A i = m j=1 ( j 1 ν χ Yn 1 1 B i (B j A j (W I(Kq,J(K q, (15 m i=j+1 A i. (16 We expand χ(w I(Kq,J(K q into a sum of 2 Kq 1 products of expected values of products of W ij s, cf. (6, with each random factor being bounded by one. Pick one of the terms in this sum. This term is thus a product of no more than K q factors (with each factor being a conditional expectation of a product of W ij s. We call these factors A i and B i, respectively, when the expectation is conditional on Y1 n and Yn 1 1, respectively. Using (1 we find that for each factor, the difference between its conditional expectations under Y1 n and Yn 1 1, respectively, is bounded by ρ n 1 max I(Kq ρ n 1 max I, that is A i B i 14

15 is bounded by this expression. Employing (16 the product, we can bound it by we arrive at the bound K q ρ n 1 max I. Using this bound for each term, we find χ Yn 1 (W I(Kq,J(K q χ Yn 1 1 (W I(Kq,J(K q 2 Kq 1 K q ρ n 1 max I, where 2 Kq 1 is the total number of terms in the sum and K q upper bounds the number of factors in each term. In addition, just as Lemma 3.4 follows from Lemma 3.2 we obtain χ Yn 1 (W I(Kq,J(K q 2 Kq 1 ρ (I(Kq and similarly for Y1 n 1. Hence, by applying (16 to (15, J ν ν γ C Jj (Y Ij 2 Kq 1 ρ (I(Kq 2 Kp 1 K p ρ n 1 max I j=1 J p=1 q p C Jj (Y Ij J 2 J 1 ρ n 1 max I (17 j=1 We can combine these two bounds, (14 and (17, by taking a geometric mean; J γ 2 J 1 j=1 As in the proof of Theorem 2.1, it follows that ( n 1 E Γ Yn 1 h(j i Γ Yn 1 1 J! C Jj (Y Ij 2 1/2 J 1/2 ρ P ν (I(Kq/2+(n 1 max I/2. (18 ( n 1 h(j i J! n 1 ( 2 1/2 J 1/2 B k ρ (n 1 i/2 J! ( 2 1/2 (1 ρ 1/2 1 B k J 1/2 J! 8 1 ρ 1/2 8 1 ρ 1/2 J 1 J 1. (19 We now proceed to bounding the second type of cumulants appearing in (13. Let (J, J be a partition of some J J + (k with J J. We can expand the cumulant similarly to Lemma 3.3(i to obtain ( n 1 h (J i, h (J n ΓYn 1 J I J n 1 1 ( J {n} J ν=1 K q ={1,..., J } M ν (K 1,..., K ν ν χ Yn 1 (h (J,J (K q I(K q. 15

16 Taking expectations, applying the bound (12 and multiplying by 1/( J! J! yields E Γ Yn 1 ( n 1 h(j i J! 2 J 1 B k, h(j n J! J I J n 1 1 ( J {n} J ν=1 K q={1,..., J } M ν (K 1,..., K ν ρ P ν (I(Kq. Fix a partition (J, J of J such that J J and an I J1 n 1 ( J {n} J. We need to look closer at the combinatorial constants M ν (K 1,..., K ν. If the partition (K 1,..., K ν is such that there is no K q with I(K q containing an element less than n as well as an element n, then M ν (K 1,..., K ν =. This is because in the graph for M ν (K 1,..., K ν (see S&S, pp. 8 there can be no edge over the vertex corresponding to the first occurrence of n in I. Hence, we may disregard partitions of this kind. Now consider a partition (K 1,..., K ν with at least one I(K q containing an element less than n and an element n, and let max I denote the second largest element of the vector I, not counting multiple n s. We can form a new vector I from I by replacing all elements of I being equal to n by max I + 1. Then ν (I (K q ν (I(K q (n max I 1. This vector I is not a member of J1 n 1 ( J {n} J (unless max I = n 1, but does belong to J max I 1 ( J {max I + 1} J ; indeed, there is a one-to-one correspondence between vectors I and I with these characteristics. Therefore J J =J J J n 1 J I J n 1 1 ( J {n} J ν=1 K q ={1,..., J } ρ n i 1 J J =J J J I J1 i( J {i+1} J ν=1 K q={1,..., J } M ν (K 1,..., K ν ρ P ν (I(Kq J M ν (K 1,..., K ν ρ P ν (I(Kq. For a fixed dimension J the summation above is done over I J1 i( J {i + 1} J, a subset of J1 n( J characterized as vectors having exactly J elements of maximal size i + 1, all being located at the end of the vector. In J1 n ( J there are more elements having exactly J elements of maximal size i + 1, disregarding their location. This number is in exact correspondence with the number of partitions (J, J of J with J as prescribed; their common value is the combinatorial constant C( J, J. Hence the 16

17 above expression is bounded by n 1 ρ n i 1 I J n 1 ( J max I=i+1 J n 1 ν=1 K q ={1,..., J } M ν (K 1,..., K ν ρ P ν (I(Kq ( 4 J 1 ρ n i 1 J! 1 ( 4 J 1 1 ρ 1 ρ J!, 1 ρ where the second last inequality is Lemma 3.3(ii; by symmetry, the bound is still valid when min I = i is replaced by max I = i. We note that this bound does in fact also take the partition (J, J = (, J, which was not considered above, into account; it corresponds to I = {n} J. Thus E Γ Yn 1 ( n 1 h(j i J!, h(j n J! 1 ( 8 J 1 1 ρ B k J!. (2 1 ρ Adding (19 and (2 as in (13 we find, using Lemma 4.1(ii in the Appendix, that E l (k (Y ( n Y1 n 1 C B k k! J 1/2 8 J 1 1 ρ 1/2 C B k k! J J + (k J J + (k C B k (1 + C k 1 k! C J 1 for some C > 8/(1 ρ 1/2 and C = 2 1/2 /(1 ρ 1/2 + 1/(1 ρ. Here is the proof of part (ii. Since D k P (X 1 = x Y j, j = t,..., n, can be expressed as polynomials in P (X 1 = x Y j and D r log P (X 1 = x Y j, 1 r k, it is enough to establish bounds for these quantities. The claim for P (X 1 = x Y j is essentially part 3 of Lemma 5 of Bickel et al. (1998. In general, note that, since { P ϑ (X 1 = x Y j = E I(X 1 = x L ϑ(x 1 j, Y j } L ϑ (Y j Y j { Lϑ = P (X 1 = x Y je (X 1 j, Y j } L ϑ (Y j Y j, X 1 = x, it holds that D r log P (X 1 = x Y 1 j = J J + (r ( r! J! J! Γ l (J (X j, Y j l (J 17 (Y j Y j, X 1 = x.

18 One can argue as for part (i of the theorem with the critical step being a bound on ν ν γ = χ (W I(Kq,J(K q Y j, X 1 = x χ (W x I(Kq,J(K q Y j, X 1 =. The argument follows the route in going from (15 to (18, We now prove part (iii of Theorem 2.2. We start by establishing a bound on derivative of the likelihood. Note that by Proposition 3.1, Theorem 2.1 and Lemma 4.1, L (k (Yn 1 C 1 nb k k! for some constants C 2, C 2, and B k. Let Then J J + (k Λ j i = ν 1 J C 2 C 1 nk!b k C2 k 2 k (21 J! χ Yj i (h(j(kq I(K q. E {γl (m (Y 1 n} E { Λ 1 min I Λ min I L (m (Y 1 n} + n+min I i= = γ 1 + γ 2, say. We bound now each of the terms. First E { Λ 1 min I i 1 Λ min I i 1 Λ 1 min I i + Λ min I i L (m (Y 1 n} γ 1 C m 1 C J 2 ( min I mm ρ P ν (I(K q/2+ min I /2 by Lemma 3.5, (21, and (18. But ( min I m m ρ max I/4 < C3 mm! for some C 3 >, whence γ 1 C1 m C J 2 m!ρ min I /4+P ν (I(Kq. Similarly, by considering the appropriate differences in the expression for γ 2 depending on whether i > max(i or vice-versa, γ 2 C m 1 C J 2 (( min I + i mm ρ P ν (I(K q/2+(i W min I /2 C3 m C J 2 m!ρp ν (I(Kq/2+(i+ min I /4 C m 4 C J 2 m!ρp ν (I(Kq/3+ min I /12. 18

19 Having the bound on E {γl (m } = γ 1 + γ 2, similar to the bound (18 on γ (except for the factor C m m!, we continue as in the first part of the proof to prove the theorem. This completes the proof of Theorem 2.2. Proof of Theorem 2.3. We proceed under the given assumptions. The first part of Theorem 2.3 follows readily from part (ii of Theorem 2.2, since ( D a l (Y 1 Y n = D a log P (X 1 = x Y ng ϑ (Y 1 x For the second part note that x ϑ=. (22 D a l (Y n 1 = n D a l (Y i Y i 1 1 = n i=(log n 2 +1 D a l (Y i Y i 1 i (log n 2 + o p (n 1/2, (23 by (22 and part (ii of Theorem 2.2. Now, under our assumptions the variables in the sum in (23 are uniformly bounded and geometrically mixing since the ith one is a function of U i,n = (Y i (log n 2,..., Y i, and the {U i,n } are uniformly in n geometrically ϕ-mixing. Then asymptotic normality, with natural centering by means and scaling by standard deviations, follows by the obvious extension to triangular arrays of the classical theorem of Ibragimov, see Doukhan (1994, p. 47 for instance. That the means and variances converge to the limit postulated is again an exercise in applying part (ii of Theorem Applications 4.1 A start at higher order asymptotics It is well known, see for example Barndorff-Nielsen and Cox (1989 that in the i.i.d. case it is possible under suitable smoothness and moment conditions to debias the MLE ˆϑ to first order, that is to construct b( such that E ϑ ( ϑ + n 1 b( ϑ = ϑ + O(n 3/2 and b b in probability, uniformly in ϑ, for a fixed continuous b. With further conditions, O(n 3/2 can be turned into O(n 2. 19

20 Other second order asymptotics results of interest are Pfanzagl s second order optimality of functions of the MLE within classes of estimates with the same bias function (see for example Bickel, Götze and van Zwet, 1985, the validity of Bartlett s correction to the likelihood ratio test (see for example Bickel and Ghosh, 199, the second order validity of bootstrap t-tests (see for example Hall, 1988 etc. The basic ingredients of debiasing are: (a A stochastic expansion for the MLE in terms of polynomials of the derivatives of l ϑ (Y n 1. (b Probability bounds on probabilities of intermediate and large deviations of the derivatives l ϑ (Y n 1. For the other types of results one further needs, (c Edgeworth expansions for the joint distribution of the first few derivatives of l ϑ (Y n 1 at ϑ. As we shall see, our bounds give (a directly. We conjecture that (b can be established using results for sums of functions of Markov Variables as in S&S. Results of type (c under simple assumptions, although plausible appear difficult to attain. Here is the argument for (a under (A1, (A2, and (A3 and real ϑ. Write ϑ for the MLE. Then, by a Taylor expansion, n 1/2 Dl (Y n 1 = n 1/2 ( ϑ ϑ n 1 D 2 l (Y n n 1/2 n( ϑ ϑ 2 n 1 D 3 l ϑ (Y n 1, where ϑ lies between ϑ and ϑ. Suppose for simplicity that all entries of A are positive and that the derivatives of log g ϑ are uniformly bounded in a neighborhood of ϑ. Then, by Theorem 2.1, under ϑ, But n 1/2 ( ϑ ϑ = n 1/2 Dl (Y1 n n 1 D 2 l (Y1 n ( 2 1 n 1/2 Dl (Y n 2 n 1/2 1 n 1 D 3 l (Y1 n n 1 D 2 l (Y1 n n 1 D 2 l (Y1 n + O p(n 1. n 1 D 2 l (Y n 1 = I(ϑ + O p (n 1/2 by Theorem 2.3 (which can be viewed as a refinement of Lemma 2 of Bickel et al., Here I(ϑ = E ϑ (D 2 log p ϑ (Y 1 Y 2. 2

21 Finally we get n 1/2 ( ϑ ϑ = n 1/2 Dl (Y n 1 I(ϑ 1 the desired stochastic expansion. 4.2 Asymptotic expansions n 1/2 Dl (Y n 1 I(ϑ 2 (n 1 D 2 l (Y n 1 + I(ϑ 1 2 n 5/2 (Dl (Y n 1 2 I(ϑ 3 D 3 l (Y n 1 + O p (n 1, The following results generalizes Theorem 3.18 of Petrie (1969. For simplicity we assume that the parameter is real. Theorem 4.1. Assume that (A1 (A3 hold. Then I(ϑ and K(ϑ are analytic functions. For instance, K(ϑ = D j K(ϑ (ϑ ϑ j j! j=1 in some neighborhood of every ϑ Θ. The series converges absolutely in some neighborhood of ϑ, since for every ϑ, D j K(ϑ C(ϑ j j!. Moreover, D 1 K(ϑ = E Dl (Y 1 Y =, D 2 K(ϑ = I(ϑ, D j+2 K(ϑ = lim n n 1 I (j n2 (ϑ, where I (j nd (ϑ 1 j = n ϑ j E [l (d ϑ (Yn 1 L ϑ (Y1 n ]. ϑ The limit I (j d (ϑ exists under our assumptions and can be represented by j k= ( j k J J + (k+d where L I = L (Y I. (k + d! J! I J 1 ( J, min(i=1 E { Γ Y 1 ( } L (J I L (j k (Y max(i 1,(24 J! 21

22 Proof. Note that I(ϑ = D 2 K(ϑ, so that it is enough to establish the claim for K(ϑ. Since n I nd = E {l (d ϑ (Y i Y1 i 1 L ϑ (Y1 n }, we obtain n 1 I (j nd (ϑ = 1 n n j k= and the bound follows from Theorem 2.2. bounded derivative of the expression ( j E {l (k+d (Y i Y1 i 1 L (j k k lim E {l (d n (Y 1 Y nl (Y n}. 1 (Y1 n } The limit is clearly given by the similarly We need the limit in this expression since L (Y 1 is not defined. The other representation follows by expanding the derivatives as in Proposition 3.1, expanding the cumulant function as in Lemma 3.3 and using this lemma and Lemma 3.5 to argue that the limit exists. lim n n 1 I (j nd (ϑ 1 = lim n n = j k= j k= ( j k ( j k J J + (k+d J J + (k+d (k + d! J! (k + d! J! I J1 ( J min(i=1 I J 1 ( J min(i=1 E {Γ Y 1 E {Γ Y 1 ( ( } L (J I L (j k (Y max(i 1 J! } L (J I L (j k (Y max(i 1. J! Corollary 4.1. If under ϑ the Y i are i.i.d., then the sum in (24 becomes in principle computable as ( L (J E Γ Y 1 I L (j k (Y max(i 1 = J! unless I 1 = 1, I 1 I i 1 = or 1, i = 1,..., J. Proof. Unless the conditions above are satisfied, Γ Y 1 I /J! =, since the indices involved could be split into two blocks of the form {i 1... i k } and {i k+1... i max(i } with i k+1 i k > (L (J

23 Since all variables in L (J I are at most 2-dependent, the conditional cumulant would have to vanish because the variables involved could be split into independent blocks. We give some explicit computations for a special case below. We note that unfortunately the number of non-zero terms in Γ Y 1 (L (J I /J! grows exponentially as a function of J. Corollary 4.2. If E log p (Y 1 < and the conditions of Theorem 2.2 hold then H(ϑ is analytic. Proof. H(ϑ = K(ϑ E ϑ log p (Y 1 Y in this case. 4.3 Example: Information under independence and a two-state Markov chain with Gaussian observations We consider now a reversible two state Markov chain with normal observations. Let X i { 1, 1}, P (X i+1 X i X i = p and Y i = X i + ε i where..., ε, ε 1,... are i.i.d. N(, σ 2, σ 2 known, random variables independent of the X process. We identify the parameter p with the ϑ of the general discussion and take ϑ = 1/2. One can derive the information from Proposition 3.1. Alternatively, one can compute directly, cf. Louis (1982 and Meilijson (1989: l ϑ (Y1 n = E (l ϑ (X n 1, Y1 n Y1 n E (l ϑ (X n 1 Y1 n Y1 n l (2 (Yn 1 = E (l (2 ϑ (Xn 1, Y1 n Y1 n E (l (2 ϑ (Xn 1 Y1 n Y1 n = E (l (2 (Xn 1, Y1 n Y1 n + var (l (1 (Xn 1 Y1 n Y1 n E l (2 ϑ (Yn 1 = E l (2 (Xn 1, Y1 n + E var (l (1 (Xn 1 Y1 n Y1 n However, if X 1, X 2,... are i.i.d. under ϑ, then we can simplify this expression. Write l(x n 1 Yn 1 = h ϑ,t = h ϑ,t (X t 1, X t, Y t, and note that, under independence, ḣ ϑ,i and ḣ ϑ,j, j i > 1 are independent given Y n 1. Moreover, E cov(ḣϑ,2, ḣϑ,3 Y n 1 = E E (ḣϑ,2ḣϑ,3 Y 2, Y 2, Y 3 E (E (ḣϑ,2 Y 1, Y 2 E(ḣϑ,3 Y 2, Y 3 =. Hence I(ϑ = E E (ḧϑ,2 Y 1, Y 2 E var(ḣϑ,2 Y 1, Y 2. We specialize to our model where h p,2 = (1 X 1 X 2 log(p/2 + (1 + X 1 X 2 log(1 p/2 + c for some c. Here I(1/2 = 4(1 E var(x 1 X 2 Y 1, Y 2 = 4E (E (X 1 Y 1 4, I (1/2 =. ϑ 23

24 It is reasonable to conjecture the following result. Theorem 4.2. Consider the two state symmetric Markov chain with normal observations as above. Then I(p has the following properties: (i Symmetry: I(p = I(1 p; (ii Unimodality with minimum at 1/2; (iii Unboundness: < lim inf p pi(p lim sup p pi(p 1. The proof of this result cannot depend on the expansion. Here is an argument. Proof. The first two properties will be proved by showing that for any p between p and 1 p there is a Markov kernel that does not depend on the unknown parameter p, and transforms the observations Y 1, Y 2,... to another sequence of variables Y1, Y 2,..., such that the latter follows the same model as the original observations but with parameter p. This shows that I(p I(p, for any such p, and in particular I(p = I(1 p. Let S 1, S 2,... be i.i.d. Bernoulli random variables with mean α, independent of the Y process and define P P i = ( 1 j=1 S i j Y i = ( 1 j=1 S j (X i + ε i = Xi + ε i. Y i Now, ε i are still i.i.d. Gaussian, and X 1, X 2,... is still Markovian, with values in { 1, 1}, but with probability of switching given by p = (1 αp + α(1 p. We now prove the third property. We will argue that for any p there is an estimator of p, valid for values of the parameter in a small neighborhood of p, whose asymptotic variance converges to as p. Since the information at p is larger than the inverse of the variance of any regular estimator, lim p I(p =. Here are the details. We consider a net of models indexed by p (, 1. The parameter space of the p model is (p p 2, p + p 2. We consider the limit as p. Let m = m(p be such that p m 1/2. For a given p : It is easy to see, for example by induction, that P (X i = 1 X 1 = 1 = (1 2pi 1. ( µ m (p = 1 m m E X i X 1 = 1 24

25 = 1 m m (1 2p i 1 1 (1 2pm = 2pm 1 e 1, Note that ( 2 1 m v m (p = E X i m = 1 m + 2 m 1 m 2 = 1 m + 2 m 1 m 2 = 1 m m j=i+1 j=i+1 + 2(1 2p m 2 E(X i X j m (1 2p j i m 1 (1 2p m i 2p = 1 1 (1 2pm (1 p + pm 2p 2 m 2 4 2e 1, Let 1 < T 1 <... T M m be the times of switches, that is X Ti 1 =... = X Ti 1 X Ti (with random M. Then the process T 1 /m, T 2 /m,..., T M /m converges weakly to a Poisson process on (, 1, and hence the limiting distribution of (m 1 m X i 2 is non-degenerate. Since the limit distribution of (m 1 m ε i 2 is degenerate, the limit distribution of (m 1 m Y i 2 is non-degenerate and equals the limiting distribution of (m 1 m X i 2. Consider now the statistics ( v m = m [n/m] m Y jm+i. n m Then j= lim lim (n/m var( v m v m (p = c v <. p n Suppose it is known that p (p p 2, p + p 2 and let the estimator p be defined by 25

26 v m ( p = v m. Since it follows that dv m (p dp = 1 m(1 2pm 1 1 (1 2pm + mp2 p 2 m 2 p 3 m 2, p dv m (p dp where the limit is taken as above. Hence 6 + 6e 1, lim inf p p I(p lim inf p p n var(ˆp = 72c 1 v (1 e 1 1 >. On the other hand the information cannot be larger than the information when the X process is observed directly. The latter is p 1 in the limit. Acknowledgement We thank the anonymous referee for an exeptionally careful reading of the first version of the manuscript, leading to a substantially improved paper. Appendix Proof of Lemma 3.1. The first part is clear since the kth derivative depends on h only through its k first derivatives at, and hence it will not be changed if h is replaced by any other function with the same k first derivatives. For the second part, we first observe that k ϑ k f(ϑ, ϑ2 /2,..., ϑ k /k! ϑ= = J C(J, k f(u 1,..., u k, u J1 u J J J J u1 = =u k = where the constants C(J, k do not depend on f, and, without loss of generality, C(J, k = C(J, k if J is a permutation of J. We will find these constants by considering a convenient family of functions f. Let f(u 1,..., u k = 26 k j=1 u m j j m j!.

27 Then J f(u 1,..., u k u J1 u J J u1 = =u k = = { 1 if J A(m 1,..., m k, otherwise, where A(m 1,..., m k = {J J + ( m i : j 1(J j = i = m i, i = 1,..., k}. Hence k ϑ k f(ϑ, ϑ2 /2,..., ϑ k /k! = C(J, k A(m 1,..., m k ϑ= = C(J, k ( k m i! k m i!, (25 where J is any member of A(m 1,..., m k. On the other hand, we can compute directly that P k f(ϑ, ϑ 2 /2,..., ϑ k ϑ im i /k! = k m i! k (i!m i and k ϑ k f(ϑ, ϑ2 /2,..., ϑ k P k ( /k! im i! Q = Q k ϑ= m k if k i! (i!m i im i = k, (26 otherwise. Comparing (25 to (26 and noting that J = k m i and J + = k im i we obtain that { k! Q J! J! if J + = k, C(J, k = otherwise, and the proof is complete. Proof of Lemma 3.5. The lemma follows since for any two random variables U 1 and U 2 with joint density h U1 U 2 (, with respect to some measure µ 1 µ 2 : h (m U 1 U 2 (u 1, u 2 = ( m m j= h (j U j 1 (u 1 h (m j U 2 U 1 (u 2 u 1. Hence, for any function f( : } E {f(u 1 h(m U 1 U 2 (U 1, U 2 m ( m = f(u 1 h (j U h U1 U 2 (U 1, U 2 j 1 (u 1 h (m j U 2 U 1 (u 2 u 1 dµ 1 (u 1 dµ 2 (u 2 j= = f(u 1 h (m U 1 (u 1 dµ 1 (u 1 } = E {f(u 1 h(m U 1 (U 1, h U1 (U 1 27

28 since h (m j U 2 U 1 (u 2 u 1 dµ 2 (u 2 = { 1 j = m j m. Now, let U 1 represents Y b a, U 2 be all the Y outside this range. Lemma 4.1. (i For any real-valued sequence a 1,..., a n, ( n k a i = with a i = for i > n. (ii For any c and k >, J J + (k J J + (k I J n 1 (k ai = J J (k c J = c(c + 1 k 1, c J J! (c kk 2 k 1. k! aj, Proof. Part (i follows by expanding the expression on the left hand side. For part (ii, let a k = J J + (k c J. Then for x small enough, a k x k = k=1 = = = k=1 J J + (k c i c i c i = c c + 1 J J (i J J (i j=1 c J x J + x J + x J x j (c + 1 k x k. k=1 i 28

29 The first claim follows. To prove the second claim note that if c k then c J / J! c k /k! (since J k and J J + (k 1 = 2k 1 (e.g. by the first part. On the other hand if c k then c J / J! k k /k!. References Barndorff-Nielsen, O. and Cox, D.R. (1989. Asymptotic Techniques for Use in Statistics. Chapman and Hall, London. Baum, L.E. and Petrie, T. (1966. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist. 37, Bickel, P.J., Götze, F. and van Zwet, W.R. (1985. A simple analysis of third-order efficiency of estimates. In Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II, Wadsworth, Belmont, CA. Bickel, P.J. and Ghosh, J.K. (199. A decomposition for the likelihood ratio statistic and the Bartlett correction A Bayesian argument. Ann. Statist. 18, Bickel, P.J. and Ritov, Y. (1996. Inference in hidden Markov models I: Local asymptotic normality in the stationary case. Bernoulli 2, Bickel, P.J., Ritov, Y. and Rydén, T. (1998. Asymptotic normality of the maximumlikelihood estimator for general hidden Markov models. Ann. Statist. 26, Douc, R., Moulines, E. and Rydén, T. (21. Asymptotic properties of the maximum likelihood estimator in autoregressive models with Markov regime. Preprint. Doukhan, P. (1994 Mixing. Properties and Examples. Springer Lecture Notes in Statistics 85. Springer-Verlag, New York. Fredkin, D.R. and Rice, J.A. (1992. Maximum likelihood estimation and identification directly from single-channel recordings. Proc. Roy. Soc. Lond. B 249, Hall, P. (1988. Rate of convergence in bootstrap approximations. Ann. Probab. 16,

30 Jensen, J.L. and Petersen, N.V. (1999. Asymptotic normality of the maximum likelihood estimator in state space models. Ann. Statist. 27, Kalman, R.E. (1977. A new approach to linear filtering and prediction problems. In Linear Least-Squares Estimation, Dowden, Hutchinson & Ross, Stroudsburg, PA. Leroux, B.G. (1992. Maximum-likelihood estimation for hidden Markov models. Stoch. Proc. Appl. 4, Leroux, B.G. and Puterman, M.L. (1992. Maximum-penalized-likelihood estimation for independent and Markov-dependent mixture models. Biometrics 48, Louis, T.A. (1982. Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. B 44, MacDonald, I.L. and Zucchini, W. (1997. Hidden Markov and Other Models for Discretevalued Time Series. Chapman & Hall, London. Meilijson, I. (1989. A fast improvement to the EM algorithm on its own terms. J. Roy. Statist. Soc. B 51, Petrie, T. (1969. Probabilistic functions of finite state Markov chains. Ann. Math. Statist. 4, Rabiner, L.R. (1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, Saulis, L. and Statulevičius, V.A. (1991. Limit Theorems for Large Deviations. Kluwer, Dordrecht. 3

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models