Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones : Details.

Size: px
Start display at page:

Download "Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones : Details."

Transcription

1 Hidden Markov model likelihoods and their derivatives behave like i.i.d. ones : Details. Peter J. Bickel Ya acov Ritov Tobias Rydén Abstract We consider the log-likelihood function of hidden Markov models, its derivatives and expectations of these (such as different information functions. We give explicit expressions for these functions and bound them as the size of the chain increases. We apply our bounds to obtain partial second order asymptotics and some qualitative properties of a special model as well as to extend some results of Petrie s (1969. Key words and phrases: Hidden Markov model, incomplete data, missing data, asymptotic normality. AMS 1991 classification: 62M9 1 Introduction In two previous papers, Bickel and Ritov (1996 and Bickel, Ritov and Rydén (1998, we developed the theory of inference for hidden Markov models (HMMs when the state space of the chain is finite but the hiding mechanism is not finitely-valued. We recall the definition of these models. An HMM is a discrete-time stochastic process {(X t, Y t } such that (i {X t } is a Markov chain, and (ii given {X t }, {Y t } is a sequence of conditionally independent random variables with the conditional distribution of Y n depending on {X t } only through X n. The distribution of {Y t } is assumed to depend smoothly on a Euclidean parameter θ. Equivalently an HMM can be thought of as Y n = h(x n, ε n, where {ε t } is an i.i.d. sequence independent of {X t }, that is a stochastic function of X n. The parameter mentioned before labels both the transition probability matrix of the chain, the function h and the distribution of {ε t }, although in principle the latter can be taken as fixed, The research of the first two authors was partially supported by NSF grant DMS and by US/Israel Bi-National Science Foundation grant 9-31/2. The third author was supported by a Swedish Research Council grant. Department of Statistics, University of California, Evans Hall, Berkeley, CA 9472, USA Department of Statistics, The Hebrew University, Jerusalem 9195, Israel Centre for Mathematical Sciences, Lund University, Box 118, S-221 Lund. Sweden 1

2 for example uniform on (, 1. In this generality, HMMs include state space models, cf. Kalman (1977. We, in this paper as before, restrict ourselves to the case where the state space of the {X t } is finite. HMMs have during the last decade become wide spread for modeling sequences of weakly dependent random variables, with applications in areas like speech processing (Rabiner, 1989, neurophysiology (Freaking and Rice, 1992, and biology (Leroux and Puterman, See also the monograph by MacDonald and Zucchini (1997. Inference for HMMs was first considered by Baum and Petrie, who treated the case when {Y t } takes values in a finite set. In Baum and Petrie (1966, results on consistency and asymptotic normality of the maximum-likelihood estimator (MLE are given, and the conditions for consistency are weakened in Petrie (1969. Baum and Petrie and particularly Petrie also studied the structure of the map ϑ E ϑ log p ϑ (Y 1 Y, Y 1,..., the conditional limiting entropy per observation. The results of Bickel et al. (1998 depend critically on bounds on derivatives of the loglikelihood of the observations. Specifically, we showed that under Cramér-type conditions at ϑ, { } k sup ϑ k log p ϑ(y 1,..., Y n : ϑ ϑ ε M k (Y 1,..., Y n, (1 where E ϑ M k (Y 1,..., Y n C < for k = 1, 2, as well as quasi-continuity of the map ϑ ( 2 / ϑ 2 log p ϑ (Y 1,..., Y n at ϑ. The same type of bounds were applied to state space models by Jensen and Petersen (1999. Similar bounds were obtained and used in the case where Y is finitely supported by Baum and Petrie (1966 and Petrie (1969. They were actually able to show that if ϑ is the transition probability matrix of the Markov chain (and when Y is finitely supported this is the most general model and Θ = {ϑ : ϑ ij δ >, i, j}, then the map ϑ E ϑ log p ϑ (Y 1 Y, Y 1,... has a convergent series expansion everywhere on Θ. Our technical goals are threefold: (i To exhibit bounds on the derivatives of the form (1 and to show that, under some conditions, M k grows no faster than nc k k!. We shall derive these bounds by a unified argument relying on results of Saulis and Statulevičius (1991 (henceforth referred to as S&S. (ii To obtain bounds on { ( } k m sup ϑ k E ϑ ϑ m log p ϑ(y 1 Y, Y 1,.... We give conditions under which this expression is bounded by C k+m (k + m!. 2

3 (iii To establish bounds on k ( ϑ k log p ϑ (Y 1 Y, Y 1,..., Y t log p ϑ (Y 1 Y, Y 1,..., Y s of the form Cρ t s for ρ < 1. Using these results, we shall, under suitable conditions (a Show how to establish stochastic asymptotic expansions for the MLE in terms of derivatives of the loglikelihood at ϑ. We sketch how in conjunction with Edgeworth type expansions for sums of functions of Markov chains, these establish the validity of procedures such as debiasing the MLE and other second order methods as available in the i.i.d. case. (b Show that the following functions are analytic. The Fisher information { } 2 I(ϑ = E ϑ ϑ 2 log p ϑ(y 1 Y, Y 1,..., the Kullback-Leibler distance { K(ϑ = E ϑ log p } ϑ (Y 1 Y, Y 1,... p ϑ and the entropy H(ϑ = E ϑ {log p ϑ (Y 1 Y, Y 1,...}. (c Study the behavior of these functions and their derivatives at points ϑ under which the X s are i.i.d. (the transition probability matrix is degenerate. We show that at such points, in principle, these quantities can be computed explicitly. (d Show how to use these expansions qualitatively to guess properties of I(ϑ which can be established in other ways. 2 Assumptions and main results We observe Y -valued random variables Y 1,..., Y n, where Y is a general space, distributed as follows. We let {X t } n t=1 be a stationary Markov chain with state space {1, 2,..., R} and transition probability matrix A ϑ = {α ϑ (, }. Then, Y 1,..., Y n are conditionally independent given X 1,..., X n and the conditional distribution of Y t depends on X t only. We 3

4 assume that these distributions have densities g ϑ (y x, where y Y and x {1,..., R}, with respect to some common σ -finite measure on Y. Thus, given X t = i, Y t has conditional density g ϑ ( i. We assume that {X t } is ergodic so that the stationary distribution exists and is unique. We denote it by π ϑ (. The parameter ϑ lies in ϑ Θ R d, where Θ is open. All computations will be done under a particular value of ϑ, denoted by ϑ. We write Ys t for (Y s,..., Y t and define X t s similarly. We can and will embed Y 1,..., Y n into {Y t } t= related to {X t } t= by the mechanism described above. When this is done, we write Y t for (..., Y t 1, Y t and define X t similarly. Probabilities and expectations will be denoted by P and E, respectively, and the conditional expectation given some random variable Y by E Y. Likelihood ratios (with respect to ϑ will be denoted by L and loglikelihood ratios by l. Note that we use the same characters to denote different functions; the specific function will be clear from the argument. If the value of the parameter is ϑ, we often replace it by. Thus L ϑ (Y1 n is the likelihood ratio of Y1 n calculated at ϑ, whereas l (X, Y is the loglikelihood ratio of X and Y (being some general random variables calculated at ϑ. To further shorten the notation, we let h t (ϑ = l ϑ (X t, Y t X t 1 1, Y1 t 1 for t 1. Thus, by the very definition of an HMM, h t (ϑ = log α ϑ (X t 1, X t +log g ϑ (Y t X t for t > 1 and h 1 (ϑ = log π ϑ (X 1 +log g ϑ (Y 1 X 1. If a = (a 1,..., a d is a d-dimensional vector with non-negative integer entries, then D a denotes the corresponding partial derivative a + /( a 1 ϑ 1 a d ϑ d, where a + = d 1 a i. We call such a a multi-index. Furthermore we define C k (y = sup max max{ D a log α ϑ (i, j + D a log g ϑ (y i + D a log π ϑ (i }, (2 a + =k i,j ϑ V with V being a neighborhood of ϑ and the outer maximum being taken over all multiindices a with a + = k, and r p t B k = max E C ji (Y t j t=1 i=p t 1 +1 i! X t = x t : 1 m k, 1 r m, } j 1,..., j m 1, j i = k, = p < p 1 <... < p r = m, x 1,..., x r {1,..., R}. The last quantity measures how big, in expectation, we can make a product of partial derivatives of the h t by distributing a total of k derivatives and possible time indices over different components of the parameter and time, respectively. In particular, if C k (y = max{c m (y : 1 m k}, then B k max{e (C k (Y 1 X 1 = x m : x {1,..., R}, 1 m k} C k, 4

5 if all C k C <. On the other hand, if the Y t are mutually independent, then B k k m=1 ( 1 max{e (C m (Y 1 X 1 = x : x {1,..., R}}. We now state the main assumptions being used in this paper. (A1 The entries of transition probability matrix A ϑ are bounded away from zero on V. (A1 Condition (A1 holds for all ϑ. (A2k For all i, j, ϑ log α ϑ (i, j and ϑ log π ϑ (i has k continuous derivatives in the neighborhood ϑ V of ϑ, and for all i and y Y, ϑ log g ϑ (y i has k continuous derivatives in the same neighborhood. (A2 All log α ϑ (i, j and log π ϑ (i and all their derivatives are uniformly bounded. (A3k B k <. (A3 All derivatives of log g ϑ (y i are uniformly bounded in y. We now state our main results. Theorem 2.1. Assume (A1, (A2k and (A3k hold. Then for all multi-indices a with a + = k, E sup D a l ϑ (Y n 1 C 1nB k C2 k k! ϑ V for some C 1 and C 2 that depend on the transition probability matrix A ϑ, ϑ V, only. Theorem 2.2. Let a and b be multi indices with a + = k and b + = m, respectively. Then the following assertions hold true. (i Under (A1, (A2k and (A3k, E D a l (Y 1 Y n C 1 B k C k 2 k! where C 1 and C 2 depend on the transition probability matrix A only. (ii Under (A1, (A2 and (A3, if n t then D a P (X 1 = x Y n D a P (X 1 = x Y t C 3 ρ t, ρ < 1 where C 3 depends on the uniform bound on all derivatives and A. (iii Under (A1, (A2k, and (A3k, E D a l (Y 1 Y nd b L (Y 1 n C 3 k!m!. Remarks. 5

6 1. The bounds in (i and (iii are the same as in the case {Y i } i.i.d. 2. The bound of (ii expresses a strong mixing property. We could prove versions of (ii and (iii under (A1, (A2k, and (A3k, but the results are technically complicated and we leave them to the reader. Theorem 2.3. Under (A1, (A2 and (A3 the following assertions hold true. (i D a l (Y 1 Y n converges P -a.s. to D a l (Y 1 Y. (ii E (D a l (Y 1 Y nd b L (Y 1 n converges to an appropriate limit as n. (iii n 1/2 (D a l (Y n 1 E D a l(y n 1 converges weakly to a N(, Var (D a l (Y 1 Y distribution under P. 3 Proofs of main results Throughout the remainder of the paper we shall assume that the parameter space Θ is one-dimensional, that is d = 1. This causes no loss of generality, but simplifies the notations as we do not need to work with mixed partial derivatives. Derivatives with respect to ϑ of order k will be denoted with superindex k, for example L (k ϑ. In many of the proofs, cumulants play a major role. Let Z 1,..., Z k be k random variables. We denote their cumulant by Γ(Z 1,..., Z k = 1 ι k ( log Ee ι(u 1Z 1 + +u k Z k u1, (3 u 1 u k =...=u k = where ι = 1. The cumulant is a multilinear function (that is, it is linear in any of the random variables if all other variables are kept fixed and, in particular, if Z 1 =... = Z k then Γ(Z 1,..., Z k is the standard kth cumulant of Z 1. Finally, if Z = (Z 1,..., Z k we write Γ(Z = Γ(Z 1,..., Z k and Γ W (Z for the cumulant of the conditional distribution of Z given W. For the proof of Theorem 2.1 we proceed as follows. (i We write l (k (Yn 1 as a linear combination of conditional cumulants of n t=1 h(j t (ϑ, 1 j k, by means of a general formula valid for any latent variable model, see (4. (ii By a generalization of results by S&S we give a bound on the individual conditional cumulants of the form (1 to yield the result. 6

7 For the proof of Theorem 2.2(i, we use the same formula expressing l (k (Y 1 Y n as l (k (Y1 n l (k (Y n and analyzing it in the same way as in Theorem 2.1. For part (ii we use a further decomposition into so-called centered moments; see below and part (ii of Theorem 2.2. We also relate L (m to the h (i, 1 i m, by another general formula. For the general formula, we need additional notation. Let Z + be the set of positive integers. We let J = {(J 1,..., J k : k >, J j Z +, j = 1,..., k}. For J J, J denotes the dimension of the vector J and J + = J j. We define the following subsets of J : J (k = {J J : J = k} and J + (k = {J J : J + = k}. Another useful set of integers is J n m(k = {(I 1,..., I k : I i Z +, m I i n, i = 1,..., k}. For any integer vector I as above, min I = min I i, max I = max I i and (I = max I min I. Furthermore, if a 1, a 2,... is any sequence and I = k, then a I = (a I1,..., a Ik. An operation between two sequences is done term wise, so that a I /b I = (a I1 /b I1,..., a Ik /b Ik. In general any operation is meant to be term by term, so J! = (J 1!, J 2!,..., ai = k a I i and a very typical expression in this paper is f (J I J! ( (J f 1 I = 1 J 1!,..., f (J k I k. J k! We now let X and Y be any random vectors such that only Y is observed with X being missing at random. Proposition 3.1. Suppose the loglikelihood ratio of the full data model l ϑ (X, Y is k times differentiable. Then the loglikelihood ratio of the observable model is k times differentiable and l (k (Y = ( k! l (J J! ΓY (X, Y, (4 J! J J + (k L (k (Y = J J + (k k! (J l (Y. (5 J! J! The first of these formulae may be viewed as a generalization of results of Louis (1982 and Meilijson (1989, relating the score function and observed information of the observable vector Y to those of the full model. The above theorem provides results also for 7

8 higher order derivatives. Both of the statements of the theorem are closely related to the exlog relations in Barndorff-Nielsen and Cox (1989, pp. 14. For the proof we need the following form of Faa di Bruno s formula, whose proof is in the Appendix. Lemma 3.1. (i For any functions f : R R and h : R R with h( = and f and h being k times differentiable, k ( ϑ k f(h(ϑ k = k ϑ= ϑ k f ϑ i h (/i! (i. i= ϑ= (ii If f : R k R, then k ϑ k f(ϑ, ϑ2 /2,..., ϑ k /k! = ϑ= J J + (k k! J! J f(u 1,..., u k J! uj1 uj J u1 =...=u k =. Proof of Proposition 3.1. We start with the representation l ϑ (Y = log E Y e l ϑ(x,y. Note that for random variables whose joint moment generating function exists in a vicinity of, the joint characteristic function in the definition (3 of the cumulant can be replaced by the joint moment generating function, and the factor ι k in the denominator then also disappears. Hence J P log E Y u e i W i = Γ Y (W J1,..., W J J = Γ Y (W J. J1 J J ui =...=u J = Now apply the first part of Lemma 3.1 to obtain { k } l (k k (Y = ϑ k log EY exp ϑ i l (i ϑ= (X, Y/i!. Apply the second part of the lemma to this expression with { k } f(u 1,..., u k = log E Y exp u i l (i (X, Y to see that l (k (Y = J J + (k k! J! J! ΓY (l (J (X, Y. 8

9 The product J! can be taken inside Γ because of the multilinearity of the cumulant function and the proof of the first part of the lemma is complete. Similarly, L (k k (Y = ϑ k el ϑ(y k = k ϑ= ϑ k exp ϑ j l (j (Y/j! = ϑ= j=1 J J + (k k! (J l (Y J! J! by Lemma 3.1 with f(u 1,..., u k = exp{ k j=1 u jl (j (Y}. The next lemma requires introduction of so-called centered moments and notation of mixing. For any random variables Z 1, Z 2,..., let χ (Z 1 = Z 1 and χ(z 1 = EZ 1, and define recursively χ (Z 1,..., Z k = Z 1 (χ (Z 2,..., Z k χ(z 2,..., Z k χ(z 1,..., Z k = Eχ (Z 1,..., Z k. χ is called the centered moment function (S&S, p. 12. For example, χ(z 1, Z 2, Z 3 = E(Z 1 Z 2 Z 3 E(Z 1 E(Z 2 Z 3 E(Z 1 Z 2 E(Z 3 + E(Z 1 E(Z 2 E(Z 3. (6 Similar to the notation for cumulants, if Z = (Z 1,..., Z k then χ(z = χ(z 1,..., Z k and χ W (Z is the centered moment of the conditional distribution of Z given W. Let Z t = g t (T t for some measurable functions g t, where {T t } t= is Markovian and obeys the following mixing condition in terms of constants ϕ t, < t <. If F m is the σ -field generated by Z t, < t m, and F n is the σ -field generated by Z t, n t <, then for all m < n, sup{ P (B A P (B : A F m, B F n, P (A > } n t=m+1 ϕ t. (7 Lemma 3.2. With {Z t } as above, assume Z t C t a.s., 1 t n, and let 1 t 1 t 2... t k n. Then χ(z t1,..., Z tk 2 k 1 k j=1 C tj t k j=t 1 +1 ϕ j. 9

10 Proof. This is essentially Theorem 4.4 of S&S. The only difference is that we allow different bounds C t on the Z t ; the validity of this extension follows easily as multiplicative constants can be moved in and out of centered moments. We now express the cumulants of sums in terms of centered moments, a generalization of a formula of S&S. Let W 1, W 2,... be random vectors, i.e. W i = (W i,1, W i,2,... etc. If J J is a set of indices, W i,j denotes the vector with elements W i,j, j J. Lemma 3.3. (i The multivariate cumulant Γ( can be expanded and bounded as ( n Γ W i,j = Γ(W I,J I J1 n( J n I J n 1 ( J min I=i J ν=1 K q ={1,..., J } M ν (K 1,..., K ν ν χ(w I(Kq,J(K q, where the inner sum is over all partitions K 1,..., K ν of the set {1,..., J }, I(K q = (I Kq,1, I Kq,2,... and J(K q is defined similarly. The M ν are non-negative combinatorial constants satisfying, in particular, that M ν ( > implies ν (I(K q (I. (ii For all i and ρ < 1, I J n 1 ( J min I=i J ν=1 K q ={1,..., J } M ν (K 1,..., K ν ρ P ν (I(K q ( 4 J 1 J!. 1 ρ To clarify the notation once more we remark that Γ(W I,J = Γ(W i1,j 1,..., W il,j l, where l = I = J. Proof. The multilinearity of the cumulant function is one of its basic properties. The bound in (i comes from S&S Lemma 1.1, where also the property of the M ν s is found. For part (ii, note that the proof of S&S Lemma 4.6 starts with their (4.55, which is equivalent to the expression in part (i of the lemma. Then, in S&S s notation, we use C = C 2 = u = 1, f(s, t = ρ t s and the bound ρ (Ip on χ(w Ip,J p ; the result now follows from (4.6 in S&S. We remark that we tacitly assume that for any cumulant Γ(W I,J, the vectors I (and J are rearranged so that the elements of I become sorted in non-decreasing order 1

11 before the cumulant is expanded into centered moments as in the above lemma. Since cumulants are invariant with respect to permutations of the random variables involved, such a rearrangement does not change the value of the cumulant, but it is necessary as we want to apply results like Lemmas 3.2 and 3.4 which do require sorted time indices. We shall now examine the mixing condition (7 for HMMs and identify the ϕ s in this particular case. Define ρ by ( 1 ρ = inf ϑ V min α (i, j min i,j i,j α (i, j with α (i, j = π (j/π (i α (j, i; note that α (i, j are the transition probabilities of the time-reversed Markov chain. Under (A1, ρ < 1. Generally, if A ϑ is ergodic there is an m, m R, such that all m-step transition probabilities are positive. Assuming m = 1, it holds that if H t is a set defined in terms of X u and Y u, t u n only, then for s < t, max P ϑ (H t Y1 n, X s = i min P ϑ (H t Y1 n, X s = i ρ t s. (8 i i This result is proved in Douc, Moulines, and Rydén (21, Corollary 1. A simple conditioning argument then yields max P ϑ (H t Y1 n, X s = i P ϑ (H t Y1 n ρ t s for all ϑ V, (9 i which is our particular version of (7. Note that we work with the conditional Markov chain X Y (it is straightforward to verify that the conditional process is still Markov, although non-homogeneous, because in view of Proposition 3.1 we want to examine conditional cumulants. When Y1 n is fixed we can identify ϕ t in (7 with ρ in (9. Following Bickel et al. (1998, Lemma 5, one can also prove that for 1 s t 1, sup P ϑ (X s A Y1 t P ϑ (X s A Y1 t 1 ρ t 1 s for all ϑ V. (1 A {1,...,R} In the case m > 1, an inequality similar to (8 still holds true but with the bound on the conditional mixing now depending on the Y t. This causes an additional degree of difficulty in our subsequent arguments and we do not treat this case. The next result now follows from the above and Lemma 3.2. Lemma 3.4. Let K q = (K q,1,..., K q,l be an element of a partition as in Lemma 3.3. Then l χ Yn 1 ϑ (ϑ 2 l 1 C JKq,j (Y IKq,j ρ (I(Kq for all ϑ V, (h(j(kq I(K q where I(K q = (I Kq,1,..., I Kq,l etc. and C k (y is defined in (2. j=1 11

12 Proof of Theorem 2.1. First note that we may replace ϑ by any ϑ in Proposition 3.1 without changing the definition of h t, as only derivatives of this function appear in the proposition. Hence Expand Γ Yn 1 ϑ bound ν l (k ϑ (Yn 1 k! ( n χ Yn 1 ϑ h(j i (h(j(k q I(K q J J + (k 1 J! 1 ΓYn ϑ ( n h(j i J! (ϑ for all ϑ V. (ϑ as in Lemma 3.3(i. We can employ Lemma 3.4 to obtain the (ϑ ν K q 2 K q 1 j=1 J = 2 J ν j=1 C JKq,j (Y IKq,j ρ (I(K q C Jj (Y Ij ρ P ν (I(K q for all ϑ V. (11 Taking the expectation of this bound and letting i 1 < i 2 <... < i r denote the distinct points of the vector I, the structure of an HMM yields E sup ν χ Yn (J(K 1 (h q I(K ϑ V q (ϑ J 2 J ν E C Jj (Y Ij ρp ν (I(Kq j=1 r 2 J ν E E C Jj (Y i l l=1 j: I j =i Xn 1 ρp ν (I(K q l r = 2 J ν E E C Jj (Y i l l=1 j: I j =i X i l ρp ν (I(K q l r 2 J 1 max E C Jj (Y i x l X i = x P ν ρ (I(Kq. (12 l l=1 j: I j =i l Multiplying by 1/ J! and using Lemma 3.3(ii we obtain { ( n } E sup 1 h(j i (ϑ n ( 8 J 1 ( 8 J 1 ΓYn ϑ B J! k J! = nb k J!. 1 ρ 1 ρ ϑ V 12

13 Hence E sup l (k ϑ (Yn 1 nk!b k ϑ V J J + (k ( 8 J 1 ( 8 nk!b k k 1, 1 ρ 1 ρ 1 ρ where the last inequality is from Lemma 4.1(ii in the Appendix. The next lemma is needed for the proof of Theorem 2.2. Lemma 3.5. Let a a b b and suppose that W measurable w.r.t. the sigma-field generated by Ya b. Then { } { } E W L (m (Ya b = E W L (m (Ya b, m =, 1, 2,... The proof is given in the Appendix. Proof of Theorem 2.2. In this proof we again use the notation of Lemma 3.3. and drop the argument ϑ of the function h as this parameter stays fixed throughout the proof. Moreover, since this lemma and other ones are formulated in terms of positive time indices we shift the indices of the statement of the theorem and set out to prove E l (k (Y n Y n 1 1 C 1 B k C k 2 k!. By Proposition 3.1 and multilinearity of the cumulant function, l (k (Y n Y1 n 1 = l (k (Yn 1 l (k = J J + (k = J J + (k k! J! k! J! (Yn 1 1 { { Γ Yn 1 Γ Yn 1 + J J =J J J ( n 1 h(j i J! ( n 1 h(j i Γ Yn 1 J! ( n 1 + h (J n h(j i J! Γ Yn 1 1 Γ Yn 1 1 ( n 1 h(j i ( n 1 h(j i J! } J! } h(j, n J, (13! where the last sum is over all partitions (J, J of the set J except (J, J = (J,. This partition is excluded since it is the first sum of the right hand side and will be compared to the second one. Clearly, there are two types of cumulants here. The first two ones are similar and their difference will be shown to remain bounded in expectation as n. 13

14 The last sum involves cumulants that contain at least one h n, and this is sufficient to keep them bounded as n. We start by considering where we assume that ν γ = χ Yn 1 (h (J(K q I(K q ν ν (I(K q (I. χ Yn 1 1 (h (J(K q I(K q This difference can be bounded in two ways. First, each term of the difference can be bounded separately. Arguing as for (11, we obtain j=1 γ ν χ Yn 1 (h (J(K q I(K q J 2 2 J 1 Secondly, we can write J ν γ = C Jj (Y Ij j=1 + ν, χyn 1 1 (h (J(K q I(K q C Jj (Y Ij ρ P ν (I(K q. (14 χ Yn 1 (W I(Kq,J(K q where W ij = h (j i /C j (Y i. Consider the scheme m m B i A i = m j=1 ( j 1 ν χ Yn 1 1 B i (B j A j (W I(Kq,J(K q, (15 m i=j+1 A i. (16 We expand χ(w I(Kq,J(K q into a sum of 2 Kq 1 products of expected values of products of W ij s, cf. (6, with each random factor being bounded by one. Pick one of the terms in this sum. This term is thus a product of no more than K q factors (with each factor being a conditional expectation of a product of W ij s. We call these factors A i and B i, respectively, when the expectation is conditional on Y1 n and Yn 1 1, respectively. Using (1 we find that for each factor, the difference between its conditional expectations under Y1 n and Yn 1 1, respectively, is bounded by ρ n 1 max I(Kq ρ n 1 max I, that is A i B i 14

15 is bounded by this expression. Employing (16 the product, we can bound it by we arrive at the bound K q ρ n 1 max I. Using this bound for each term, we find χ Yn 1 (W I(Kq,J(K q χ Yn 1 1 (W I(Kq,J(K q 2 Kq 1 K q ρ n 1 max I, where 2 Kq 1 is the total number of terms in the sum and K q upper bounds the number of factors in each term. In addition, just as Lemma 3.4 follows from Lemma 3.2 we obtain χ Yn 1 (W I(Kq,J(K q 2 Kq 1 ρ (I(Kq and similarly for Y1 n 1. Hence, by applying (16 to (15, J ν ν γ C Jj (Y Ij 2 Kq 1 ρ (I(Kq 2 Kp 1 K p ρ n 1 max I j=1 J p=1 q p C Jj (Y Ij J 2 J 1 ρ n 1 max I (17 j=1 We can combine these two bounds, (14 and (17, by taking a geometric mean; J γ 2 J 1 j=1 As in the proof of Theorem 2.1, it follows that ( n 1 E Γ Yn 1 h(j i Γ Yn 1 1 J! C Jj (Y Ij 2 1/2 J 1/2 ρ P ν (I(Kq/2+(n 1 max I/2. (18 ( n 1 h(j i J! n 1 ( 2 1/2 J 1/2 B k ρ (n 1 i/2 J! ( 2 1/2 (1 ρ 1/2 1 B k J 1/2 J! 8 1 ρ 1/2 8 1 ρ 1/2 J 1 J 1. (19 We now proceed to bounding the second type of cumulants appearing in (13. Let (J, J be a partition of some J J + (k with J J. We can expand the cumulant similarly to Lemma 3.3(i to obtain ( n 1 h (J i, h (J n ΓYn 1 J I J n 1 1 ( J {n} J ν=1 K q ={1,..., J } M ν (K 1,..., K ν ν χ Yn 1 (h (J,J (K q I(K q. 15

16 Taking expectations, applying the bound (12 and multiplying by 1/( J! J! yields E Γ Yn 1 ( n 1 h(j i J! 2 J 1 B k, h(j n J! J I J n 1 1 ( J {n} J ν=1 K q={1,..., J } M ν (K 1,..., K ν ρ P ν (I(Kq. Fix a partition (J, J of J such that J J and an I J1 n 1 ( J {n} J. We need to look closer at the combinatorial constants M ν (K 1,..., K ν. If the partition (K 1,..., K ν is such that there is no K q with I(K q containing an element less than n as well as an element n, then M ν (K 1,..., K ν =. This is because in the graph for M ν (K 1,..., K ν (see S&S, pp. 8 there can be no edge over the vertex corresponding to the first occurrence of n in I. Hence, we may disregard partitions of this kind. Now consider a partition (K 1,..., K ν with at least one I(K q containing an element less than n and an element n, and let max I denote the second largest element of the vector I, not counting multiple n s. We can form a new vector I from I by replacing all elements of I being equal to n by max I + 1. Then ν (I (K q ν (I(K q (n max I 1. This vector I is not a member of J1 n 1 ( J {n} J (unless max I = n 1, but does belong to J max I 1 ( J {max I + 1} J ; indeed, there is a one-to-one correspondence between vectors I and I with these characteristics. Therefore J J =J J J n 1 J I J n 1 1 ( J {n} J ν=1 K q ={1,..., J } ρ n i 1 J J =J J J I J1 i( J {i+1} J ν=1 K q={1,..., J } M ν (K 1,..., K ν ρ P ν (I(Kq J M ν (K 1,..., K ν ρ P ν (I(Kq. For a fixed dimension J the summation above is done over I J1 i( J {i + 1} J, a subset of J1 n( J characterized as vectors having exactly J elements of maximal size i + 1, all being located at the end of the vector. In J1 n ( J there are more elements having exactly J elements of maximal size i + 1, disregarding their location. This number is in exact correspondence with the number of partitions (J, J of J with J as prescribed; their common value is the combinatorial constant C( J, J. Hence the 16

17 above expression is bounded by n 1 ρ n i 1 I J n 1 ( J max I=i+1 J n 1 ν=1 K q ={1,..., J } M ν (K 1,..., K ν ρ P ν (I(Kq ( 4 J 1 ρ n i 1 J! 1 ( 4 J 1 1 ρ 1 ρ J!, 1 ρ where the second last inequality is Lemma 3.3(ii; by symmetry, the bound is still valid when min I = i is replaced by max I = i. We note that this bound does in fact also take the partition (J, J = (, J, which was not considered above, into account; it corresponds to I = {n} J. Thus E Γ Yn 1 ( n 1 h(j i J!, h(j n J! 1 ( 8 J 1 1 ρ B k J!. (2 1 ρ Adding (19 and (2 as in (13 we find, using Lemma 4.1(ii in the Appendix, that E l (k (Y ( n Y1 n 1 C B k k! J 1/2 8 J 1 1 ρ 1/2 C B k k! J J + (k J J + (k C B k (1 + C k 1 k! C J 1 for some C > 8/(1 ρ 1/2 and C = 2 1/2 /(1 ρ 1/2 + 1/(1 ρ. Here is the proof of part (ii. Since D k P (X 1 = x Y j, j = t,..., n, can be expressed as polynomials in P (X 1 = x Y j and D r log P (X 1 = x Y j, 1 r k, it is enough to establish bounds for these quantities. The claim for P (X 1 = x Y j is essentially part 3 of Lemma 5 of Bickel et al. (1998. In general, note that, since { P ϑ (X 1 = x Y j = E I(X 1 = x L ϑ(x 1 j, Y j } L ϑ (Y j Y j { Lϑ = P (X 1 = x Y je (X 1 j, Y j } L ϑ (Y j Y j, X 1 = x, it holds that D r log P (X 1 = x Y 1 j = J J + (r ( r! J! J! Γ l (J (X j, Y j l (J 17 (Y j Y j, X 1 = x.

18 One can argue as for part (i of the theorem with the critical step being a bound on ν ν γ = χ (W I(Kq,J(K q Y j, X 1 = x χ (W x I(Kq,J(K q Y j, X 1 =. The argument follows the route in going from (15 to (18, We now prove part (iii of Theorem 2.2. We start by establishing a bound on derivative of the likelihood. Note that by Proposition 3.1, Theorem 2.1 and Lemma 4.1, L (k (Yn 1 C 1 nb k k! for some constants C 2, C 2, and B k. Let Then J J + (k Λ j i = ν 1 J C 2 C 1 nk!b k C2 k 2 k (21 J! χ Yj i (h(j(kq I(K q. E {γl (m (Y 1 n} E { Λ 1 min I Λ min I L (m (Y 1 n} + n+min I i= = γ 1 + γ 2, say. We bound now each of the terms. First E { Λ 1 min I i 1 Λ min I i 1 Λ 1 min I i + Λ min I i L (m (Y 1 n} γ 1 C m 1 C J 2 ( min I mm ρ P ν (I(K q/2+ min I /2 by Lemma 3.5, (21, and (18. But ( min I m m ρ max I/4 < C3 mm! for some C 3 >, whence γ 1 C1 m C J 2 m!ρ min I /4+P ν (I(Kq. Similarly, by considering the appropriate differences in the expression for γ 2 depending on whether i > max(i or vice-versa, γ 2 C m 1 C J 2 (( min I + i mm ρ P ν (I(K q/2+(i W min I /2 C3 m C J 2 m!ρp ν (I(Kq/2+(i+ min I /4 C m 4 C J 2 m!ρp ν (I(Kq/3+ min I /12. 18

19 Having the bound on E {γl (m } = γ 1 + γ 2, similar to the bound (18 on γ (except for the factor C m m!, we continue as in the first part of the proof to prove the theorem. This completes the proof of Theorem 2.2. Proof of Theorem 2.3. We proceed under the given assumptions. The first part of Theorem 2.3 follows readily from part (ii of Theorem 2.2, since ( D a l (Y 1 Y n = D a log P (X 1 = x Y ng ϑ (Y 1 x For the second part note that x ϑ=. (22 D a l (Y n 1 = n D a l (Y i Y i 1 1 = n i=(log n 2 +1 D a l (Y i Y i 1 i (log n 2 + o p (n 1/2, (23 by (22 and part (ii of Theorem 2.2. Now, under our assumptions the variables in the sum in (23 are uniformly bounded and geometrically mixing since the ith one is a function of U i,n = (Y i (log n 2,..., Y i, and the {U i,n } are uniformly in n geometrically ϕ-mixing. Then asymptotic normality, with natural centering by means and scaling by standard deviations, follows by the obvious extension to triangular arrays of the classical theorem of Ibragimov, see Doukhan (1994, p. 47 for instance. That the means and variances converge to the limit postulated is again an exercise in applying part (ii of Theorem Applications 4.1 A start at higher order asymptotics It is well known, see for example Barndorff-Nielsen and Cox (1989 that in the i.i.d. case it is possible under suitable smoothness and moment conditions to debias the MLE ˆϑ to first order, that is to construct b( such that E ϑ ( ϑ + n 1 b( ϑ = ϑ + O(n 3/2 and b b in probability, uniformly in ϑ, for a fixed continuous b. With further conditions, O(n 3/2 can be turned into O(n 2. 19

20 Other second order asymptotics results of interest are Pfanzagl s second order optimality of functions of the MLE within classes of estimates with the same bias function (see for example Bickel, Götze and van Zwet, 1985, the validity of Bartlett s correction to the likelihood ratio test (see for example Bickel and Ghosh, 199, the second order validity of bootstrap t-tests (see for example Hall, 1988 etc. The basic ingredients of debiasing are: (a A stochastic expansion for the MLE in terms of polynomials of the derivatives of l ϑ (Y n 1. (b Probability bounds on probabilities of intermediate and large deviations of the derivatives l ϑ (Y n 1. For the other types of results one further needs, (c Edgeworth expansions for the joint distribution of the first few derivatives of l ϑ (Y n 1 at ϑ. As we shall see, our bounds give (a directly. We conjecture that (b can be established using results for sums of functions of Markov Variables as in S&S. Results of type (c under simple assumptions, although plausible appear difficult to attain. Here is the argument for (a under (A1, (A2, and (A3 and real ϑ. Write ϑ for the MLE. Then, by a Taylor expansion, n 1/2 Dl (Y n 1 = n 1/2 ( ϑ ϑ n 1 D 2 l (Y n n 1/2 n( ϑ ϑ 2 n 1 D 3 l ϑ (Y n 1, where ϑ lies between ϑ and ϑ. Suppose for simplicity that all entries of A are positive and that the derivatives of log g ϑ are uniformly bounded in a neighborhood of ϑ. Then, by Theorem 2.1, under ϑ, But n 1/2 ( ϑ ϑ = n 1/2 Dl (Y1 n n 1 D 2 l (Y1 n ( 2 1 n 1/2 Dl (Y n 2 n 1/2 1 n 1 D 3 l (Y1 n n 1 D 2 l (Y1 n n 1 D 2 l (Y1 n + O p(n 1. n 1 D 2 l (Y n 1 = I(ϑ + O p (n 1/2 by Theorem 2.3 (which can be viewed as a refinement of Lemma 2 of Bickel et al., Here I(ϑ = E ϑ (D 2 log p ϑ (Y 1 Y 2. 2

21 Finally we get n 1/2 ( ϑ ϑ = n 1/2 Dl (Y n 1 I(ϑ 1 the desired stochastic expansion. 4.2 Asymptotic expansions n 1/2 Dl (Y n 1 I(ϑ 2 (n 1 D 2 l (Y n 1 + I(ϑ 1 2 n 5/2 (Dl (Y n 1 2 I(ϑ 3 D 3 l (Y n 1 + O p (n 1, The following results generalizes Theorem 3.18 of Petrie (1969. For simplicity we assume that the parameter is real. Theorem 4.1. Assume that (A1 (A3 hold. Then I(ϑ and K(ϑ are analytic functions. For instance, K(ϑ = D j K(ϑ (ϑ ϑ j j! j=1 in some neighborhood of every ϑ Θ. The series converges absolutely in some neighborhood of ϑ, since for every ϑ, D j K(ϑ C(ϑ j j!. Moreover, D 1 K(ϑ = E Dl (Y 1 Y =, D 2 K(ϑ = I(ϑ, D j+2 K(ϑ = lim n n 1 I (j n2 (ϑ, where I (j nd (ϑ 1 j = n ϑ j E [l (d ϑ (Yn 1 L ϑ (Y1 n ]. ϑ The limit I (j d (ϑ exists under our assumptions and can be represented by j k= ( j k J J + (k+d where L I = L (Y I. (k + d! J! I J 1 ( J, min(i=1 E { Γ Y 1 ( } L (J I L (j k (Y max(i 1,(24 J! 21

22 Proof. Note that I(ϑ = D 2 K(ϑ, so that it is enough to establish the claim for K(ϑ. Since n I nd = E {l (d ϑ (Y i Y1 i 1 L ϑ (Y1 n }, we obtain n 1 I (j nd (ϑ = 1 n n j k= and the bound follows from Theorem 2.2. bounded derivative of the expression ( j E {l (k+d (Y i Y1 i 1 L (j k k lim E {l (d n (Y 1 Y nl (Y n}. 1 (Y1 n } The limit is clearly given by the similarly We need the limit in this expression since L (Y 1 is not defined. The other representation follows by expanding the derivatives as in Proposition 3.1, expanding the cumulant function as in Lemma 3.3 and using this lemma and Lemma 3.5 to argue that the limit exists. lim n n 1 I (j nd (ϑ 1 = lim n n = j k= j k= ( j k ( j k J J + (k+d J J + (k+d (k + d! J! (k + d! J! I J1 ( J min(i=1 I J 1 ( J min(i=1 E {Γ Y 1 E {Γ Y 1 ( ( } L (J I L (j k (Y max(i 1 J! } L (J I L (j k (Y max(i 1. J! Corollary 4.1. If under ϑ the Y i are i.i.d., then the sum in (24 becomes in principle computable as ( L (J E Γ Y 1 I L (j k (Y max(i 1 = J! unless I 1 = 1, I 1 I i 1 = or 1, i = 1,..., J. Proof. Unless the conditions above are satisfied, Γ Y 1 I /J! =, since the indices involved could be split into two blocks of the form {i 1... i k } and {i k+1... i max(i } with i k+1 i k > (L (J

23 Since all variables in L (J I are at most 2-dependent, the conditional cumulant would have to vanish because the variables involved could be split into independent blocks. We give some explicit computations for a special case below. We note that unfortunately the number of non-zero terms in Γ Y 1 (L (J I /J! grows exponentially as a function of J. Corollary 4.2. If E log p (Y 1 < and the conditions of Theorem 2.2 hold then H(ϑ is analytic. Proof. H(ϑ = K(ϑ E ϑ log p (Y 1 Y in this case. 4.3 Example: Information under independence and a two-state Markov chain with Gaussian observations We consider now a reversible two state Markov chain with normal observations. Let X i { 1, 1}, P (X i+1 X i X i = p and Y i = X i + ε i where..., ε, ε 1,... are i.i.d. N(, σ 2, σ 2 known, random variables independent of the X process. We identify the parameter p with the ϑ of the general discussion and take ϑ = 1/2. One can derive the information from Proposition 3.1. Alternatively, one can compute directly, cf. Louis (1982 and Meilijson (1989: l ϑ (Y1 n = E (l ϑ (X n 1, Y1 n Y1 n E (l ϑ (X n 1 Y1 n Y1 n l (2 (Yn 1 = E (l (2 ϑ (Xn 1, Y1 n Y1 n E (l (2 ϑ (Xn 1 Y1 n Y1 n = E (l (2 (Xn 1, Y1 n Y1 n + var (l (1 (Xn 1 Y1 n Y1 n E l (2 ϑ (Yn 1 = E l (2 (Xn 1, Y1 n + E var (l (1 (Xn 1 Y1 n Y1 n However, if X 1, X 2,... are i.i.d. under ϑ, then we can simplify this expression. Write l(x n 1 Yn 1 = h ϑ,t = h ϑ,t (X t 1, X t, Y t, and note that, under independence, ḣ ϑ,i and ḣ ϑ,j, j i > 1 are independent given Y n 1. Moreover, E cov(ḣϑ,2, ḣϑ,3 Y n 1 = E E (ḣϑ,2ḣϑ,3 Y 2, Y 2, Y 3 E (E (ḣϑ,2 Y 1, Y 2 E(ḣϑ,3 Y 2, Y 3 =. Hence I(ϑ = E E (ḧϑ,2 Y 1, Y 2 E var(ḣϑ,2 Y 1, Y 2. We specialize to our model where h p,2 = (1 X 1 X 2 log(p/2 + (1 + X 1 X 2 log(1 p/2 + c for some c. Here I(1/2 = 4(1 E var(x 1 X 2 Y 1, Y 2 = 4E (E (X 1 Y 1 4, I (1/2 =. ϑ 23

24 It is reasonable to conjecture the following result. Theorem 4.2. Consider the two state symmetric Markov chain with normal observations as above. Then I(p has the following properties: (i Symmetry: I(p = I(1 p; (ii Unimodality with minimum at 1/2; (iii Unboundness: < lim inf p pi(p lim sup p pi(p 1. The proof of this result cannot depend on the expansion. Here is an argument. Proof. The first two properties will be proved by showing that for any p between p and 1 p there is a Markov kernel that does not depend on the unknown parameter p, and transforms the observations Y 1, Y 2,... to another sequence of variables Y1, Y 2,..., such that the latter follows the same model as the original observations but with parameter p. This shows that I(p I(p, for any such p, and in particular I(p = I(1 p. Let S 1, S 2,... be i.i.d. Bernoulli random variables with mean α, independent of the Y process and define P P i = ( 1 j=1 S i j Y i = ( 1 j=1 S j (X i + ε i = Xi + ε i. Y i Now, ε i are still i.i.d. Gaussian, and X 1, X 2,... is still Markovian, with values in { 1, 1}, but with probability of switching given by p = (1 αp + α(1 p. We now prove the third property. We will argue that for any p there is an estimator of p, valid for values of the parameter in a small neighborhood of p, whose asymptotic variance converges to as p. Since the information at p is larger than the inverse of the variance of any regular estimator, lim p I(p =. Here are the details. We consider a net of models indexed by p (, 1. The parameter space of the p model is (p p 2, p + p 2. We consider the limit as p. Let m = m(p be such that p m 1/2. For a given p : It is easy to see, for example by induction, that P (X i = 1 X 1 = 1 = (1 2pi 1. ( µ m (p = 1 m m E X i X 1 = 1 24

25 = 1 m m (1 2p i 1 1 (1 2pm = 2pm 1 e 1, Note that ( 2 1 m v m (p = E X i m = 1 m + 2 m 1 m 2 = 1 m + 2 m 1 m 2 = 1 m m j=i+1 j=i+1 + 2(1 2p m 2 E(X i X j m (1 2p j i m 1 (1 2p m i 2p = 1 1 (1 2pm (1 p + pm 2p 2 m 2 4 2e 1, Let 1 < T 1 <... T M m be the times of switches, that is X Ti 1 =... = X Ti 1 X Ti (with random M. Then the process T 1 /m, T 2 /m,..., T M /m converges weakly to a Poisson process on (, 1, and hence the limiting distribution of (m 1 m X i 2 is non-degenerate. Since the limit distribution of (m 1 m ε i 2 is degenerate, the limit distribution of (m 1 m Y i 2 is non-degenerate and equals the limiting distribution of (m 1 m X i 2. Consider now the statistics ( v m = m [n/m] m Y jm+i. n m Then j= lim lim (n/m var( v m v m (p = c v <. p n Suppose it is known that p (p p 2, p + p 2 and let the estimator p be defined by 25

26 v m ( p = v m. Since it follows that dv m (p dp = 1 m(1 2pm 1 1 (1 2pm + mp2 p 2 m 2 p 3 m 2, p dv m (p dp where the limit is taken as above. Hence 6 + 6e 1, lim inf p p I(p lim inf p p n var(ˆp = 72c 1 v (1 e 1 1 >. On the other hand the information cannot be larger than the information when the X process is observed directly. The latter is p 1 in the limit. Acknowledgement We thank the anonymous referee for an exeptionally careful reading of the first version of the manuscript, leading to a substantially improved paper. Appendix Proof of Lemma 3.1. The first part is clear since the kth derivative depends on h only through its k first derivatives at, and hence it will not be changed if h is replaced by any other function with the same k first derivatives. For the second part, we first observe that k ϑ k f(ϑ, ϑ2 /2,..., ϑ k /k! ϑ= = J C(J, k f(u 1,..., u k, u J1 u J J J J u1 = =u k = where the constants C(J, k do not depend on f, and, without loss of generality, C(J, k = C(J, k if J is a permutation of J. We will find these constants by considering a convenient family of functions f. Let f(u 1,..., u k = 26 k j=1 u m j j m j!.

27 Then J f(u 1,..., u k u J1 u J J u1 = =u k = = { 1 if J A(m 1,..., m k, otherwise, where A(m 1,..., m k = {J J + ( m i : j 1(J j = i = m i, i = 1,..., k}. Hence k ϑ k f(ϑ, ϑ2 /2,..., ϑ k /k! = C(J, k A(m 1,..., m k ϑ= = C(J, k ( k m i! k m i!, (25 where J is any member of A(m 1,..., m k. On the other hand, we can compute directly that P k f(ϑ, ϑ 2 /2,..., ϑ k ϑ im i /k! = k m i! k (i!m i and k ϑ k f(ϑ, ϑ2 /2,..., ϑ k P k ( /k! im i! Q = Q k ϑ= m k if k i! (i!m i im i = k, (26 otherwise. Comparing (25 to (26 and noting that J = k m i and J + = k im i we obtain that { k! Q J! J! if J + = k, C(J, k = otherwise, and the proof is complete. Proof of Lemma 3.5. The lemma follows since for any two random variables U 1 and U 2 with joint density h U1 U 2 (, with respect to some measure µ 1 µ 2 : h (m U 1 U 2 (u 1, u 2 = ( m m j= h (j U j 1 (u 1 h (m j U 2 U 1 (u 2 u 1. Hence, for any function f( : } E {f(u 1 h(m U 1 U 2 (U 1, U 2 m ( m = f(u 1 h (j U h U1 U 2 (U 1, U 2 j 1 (u 1 h (m j U 2 U 1 (u 2 u 1 dµ 1 (u 1 dµ 2 (u 2 j= = f(u 1 h (m U 1 (u 1 dµ 1 (u 1 } = E {f(u 1 h(m U 1 (U 1, h U1 (U 1 27

28 since h (m j U 2 U 1 (u 2 u 1 dµ 2 (u 2 = { 1 j = m j m. Now, let U 1 represents Y b a, U 2 be all the Y outside this range. Lemma 4.1. (i For any real-valued sequence a 1,..., a n, ( n k a i = with a i = for i > n. (ii For any c and k >, J J + (k J J + (k I J n 1 (k ai = J J (k c J = c(c + 1 k 1, c J J! (c kk 2 k 1. k! aj, Proof. Part (i follows by expanding the expression on the left hand side. For part (ii, let a k = J J + (k c J. Then for x small enough, a k x k = k=1 = = = k=1 J J + (k c i c i c i = c c + 1 J J (i J J (i j=1 c J x J + x J + x J x j (c + 1 k x k. k=1 i 28

29 The first claim follows. To prove the second claim note that if c k then c J / J! c k /k! (since J k and J J + (k 1 = 2k 1 (e.g. by the first part. On the other hand if c k then c J / J! k k /k!. References Barndorff-Nielsen, O. and Cox, D.R. (1989. Asymptotic Techniques for Use in Statistics. Chapman and Hall, London. Baum, L.E. and Petrie, T. (1966. Statistical inference for probabilistic functions of finite state Markov chains. Ann. Math. Statist. 37, Bickel, P.J., Götze, F. and van Zwet, W.R. (1985. A simple analysis of third-order efficiency of estimates. In Proceedings of the Berkeley conference in honor of Jerzy Neyman and Jack Kiefer, Vol. II, Wadsworth, Belmont, CA. Bickel, P.J. and Ghosh, J.K. (199. A decomposition for the likelihood ratio statistic and the Bartlett correction A Bayesian argument. Ann. Statist. 18, Bickel, P.J. and Ritov, Y. (1996. Inference in hidden Markov models I: Local asymptotic normality in the stationary case. Bernoulli 2, Bickel, P.J., Ritov, Y. and Rydén, T. (1998. Asymptotic normality of the maximumlikelihood estimator for general hidden Markov models. Ann. Statist. 26, Douc, R., Moulines, E. and Rydén, T. (21. Asymptotic properties of the maximum likelihood estimator in autoregressive models with Markov regime. Preprint. Doukhan, P. (1994 Mixing. Properties and Examples. Springer Lecture Notes in Statistics 85. Springer-Verlag, New York. Fredkin, D.R. and Rice, J.A. (1992. Maximum likelihood estimation and identification directly from single-channel recordings. Proc. Roy. Soc. Lond. B 249, Hall, P. (1988. Rate of convergence in bootstrap approximations. Ann. Probab. 16,

30 Jensen, J.L. and Petersen, N.V. (1999. Asymptotic normality of the maximum likelihood estimator in state space models. Ann. Statist. 27, Kalman, R.E. (1977. A new approach to linear filtering and prediction problems. In Linear Least-Squares Estimation, Dowden, Hutchinson & Ross, Stroudsburg, PA. Leroux, B.G. (1992. Maximum-likelihood estimation for hidden Markov models. Stoch. Proc. Appl. 4, Leroux, B.G. and Puterman, M.L. (1992. Maximum-penalized-likelihood estimation for independent and Markov-dependent mixture models. Biometrics 48, Louis, T.A. (1982. Finding the observed information matrix when using the EM algorithm. J. Roy. Statist. Soc. B 44, MacDonald, I.L. and Zucchini, W. (1997. Hidden Markov and Other Models for Discretevalued Time Series. Chapman & Hall, London. Meilijson, I. (1989. A fast improvement to the EM algorithm on its own terms. J. Roy. Statist. Soc. B 51, Petrie, T. (1969. Probabilistic functions of finite state Markov chains. Ann. Math. Statist. 4, Rabiner, L.R. (1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, Saulis, L. and Statulevičius, V.A. (1991. Limit Theorems for Large Deviations. Kluwer, Dordrecht. 3

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

Introduction. log p θ (y k y 1:k 1 ), k=1

Introduction. log p θ (y k y 1:k 1 ), k=1 ESAIM: PROCEEDINGS, September 2007, Vol.19, 115-120 Christophe Andrieu & Dan Crisan, Editors DOI: 10.1051/proc:071915 PARTICLE FILTER-BASED APPROXIMATE MAXIMUM LIKELIHOOD INFERENCE ASYMPTOTICS IN STATE-SPACE

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

On prediction and density estimation Peter McCullagh University of Chicago December 2004

On prediction and density estimation Peter McCullagh University of Chicago December 2004 On prediction and density estimation Peter McCullagh University of Chicago December 2004 Summary Having observed the initial segment of a random sequence, subsequent values may be predicted by calculating

More information

SMSTC (2007/08) Probability.

SMSTC (2007/08) Probability. SMSTC (27/8) Probability www.smstc.ac.uk Contents 12 Markov chains in continuous time 12 1 12.1 Markov property and the Kolmogorov equations.................... 12 2 12.1.1 Finite state space.................................

More information

Notes 6 : First and second moment methods

Notes 6 : First and second moment methods Notes 6 : First and second moment methods Math 733-734: Theory of Probability Lecturer: Sebastien Roch References: [Roc, Sections 2.1-2.3]. Recall: THM 6.1 (Markov s inequality) Let X be a non-negative

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Inference in non-linear time series

Inference in non-linear time series Intro LS MLE Other Erik Lindström Centre for Mathematical Sciences Lund University LU/LTH & DTU Intro LS MLE Other General Properties Popular estimatiors Overview Introduction General Properties Estimators

More information

New lower bounds for hypergraph Ramsey numbers

New lower bounds for hypergraph Ramsey numbers New lower bounds for hypergraph Ramsey numbers Dhruv Mubayi Andrew Suk Abstract The Ramsey number r k (s, n) is the minimum N such that for every red-blue coloring of the k-tuples of {1,..., N}, there

More information

Chapter 3. Point Estimation. 3.1 Introduction

Chapter 3. Point Estimation. 3.1 Introduction Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.

More information

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535

Additive functionals of infinite-variance moving averages. Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Additive functionals of infinite-variance moving averages Wei Biao Wu The University of Chicago TECHNICAL REPORT NO. 535 Departments of Statistics The University of Chicago Chicago, Illinois 60637 June

More information

Submitted to the Brazilian Journal of Probability and Statistics

Submitted to the Brazilian Journal of Probability and Statistics Submitted to the Brazilian Journal of Probability and Statistics Multivariate normal approximation of the maximum likelihood estimator via the delta method Andreas Anastasiou a and Robert E. Gaunt b a

More information

Consistency of Quasi-Maximum Likelihood Estimators for the Regime-Switching GARCH Models

Consistency of Quasi-Maximum Likelihood Estimators for the Regime-Switching GARCH Models Consistency of Quasi-Maximum Likelihood Estimators for the Regime-Switching GARCH Models Yingfu Xie Research Report Centre of Biostochastics Swedish University of Report 2005:3 Agricultural Sciences ISSN

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

L n = l n (π n ) = length of a longest increasing subsequence of π n.

L n = l n (π n ) = length of a longest increasing subsequence of π n. Longest increasing subsequences π n : permutation of 1,2,...,n. L n = l n (π n ) = length of a longest increasing subsequence of π n. Example: π n = (π n (1),..., π n (n)) = (7, 2, 8, 1, 3, 4, 10, 6, 9,

More information

SENSITIVITY OF HIDDEN MARKOV MODELS

SENSITIVITY OF HIDDEN MARKOV MODELS J. Appl. Prob. 42, 632 642 (2005) Printed in Israel Applied Probability Trust 2005 SENSITIVITY OF HIDDEN MARKOV MODELS ALEXANDER Yu. MITROPHANOV, ALEXANDRE LOMSADZE and MARK BORODOVSKY, Georgia Institute

More information

Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach

Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach Goodness-of-Fit Tests for Time Series Models: A Score-Marked Empirical Process Approach By Shiqing Ling Department of Mathematics Hong Kong University of Science and Technology Let {y t : t = 0, ±1, ±2,

More information

Concentration of Measures by Bounded Couplings

Concentration of Measures by Bounded Couplings Concentration of Measures by Bounded Couplings Subhankar Ghosh, Larry Goldstein and Ümit Işlak University of Southern California [arxiv:0906.3886] [arxiv:1304.5001] May 2013 Concentration of Measure Distributional

More information

A Recursive Learning Algorithm for Model Reduction of Hidden Markov Models

A Recursive Learning Algorithm for Model Reduction of Hidden Markov Models A Recursive Learning Algorithm for Model Reduction of Hidden Markov Models Kun Deng, Prashant G. Mehta, Sean P. Meyn, and Mathukumalli Vidyasagar Abstract This paper is concerned with a recursive learning

More information

Bayesian estimation of Hidden Markov Models

Bayesian estimation of Hidden Markov Models Bayesian estimation of Hidden Markov Models Laurent Mevel, Lorenzo Finesso IRISA/INRIA Campus de Beaulieu, 35042 Rennes Cedex, France Institute of Systems Science and Biomedical Engineering LADSEB-CNR

More information

Stochastic Proximal Gradient Algorithm

Stochastic Proximal Gradient Algorithm Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind

More information

Stein Couplings for Concentration of Measure

Stein Couplings for Concentration of Measure Stein Couplings for Concentration of Measure Jay Bartroff, Subhankar Ghosh, Larry Goldstein and Ümit Işlak University of Southern California [arxiv:0906.3886] [arxiv:1304.5001] [arxiv:1402.6769] Borchard

More information

Hidden Markov models: definition and properties

Hidden Markov models: definition and properties CHAPTER Hidden Markov models: definition and properties. A simple hidden Markov model Consider again the observed earthquake series displayed in Figure. on p.. The observations are unbounded counts, making

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Verifying Regularity Conditions for Logit-Normal GLMM

Verifying Regularity Conditions for Logit-Normal GLMM Verifying Regularity Conditions for Logit-Normal GLMM Yun Ju Sung Charles J. Geyer January 10, 2006 In this note we verify the conditions of the theorems in Sung and Geyer (submitted) for the Logit-Normal

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains

Markov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains Markov Chains A random process X is a family {X t : t T } of random variables indexed by some set T. When T = {0, 1, 2,... } one speaks about a discrete-time process, for T = R or T = [0, ) one has a continuous-time

More information

LARGE DEVIATION PROBABILITIES FOR SUMS OF HEAVY-TAILED DEPENDENT RANDOM VECTORS*

LARGE DEVIATION PROBABILITIES FOR SUMS OF HEAVY-TAILED DEPENDENT RANDOM VECTORS* LARGE EVIATION PROBABILITIES FOR SUMS OF HEAVY-TAILE EPENENT RANOM VECTORS* Adam Jakubowski Alexander V. Nagaev Alexander Zaigraev Nicholas Copernicus University Faculty of Mathematics and Computer Science

More information

On the Central Limit Theorem for an ergodic Markov chain

On the Central Limit Theorem for an ergodic Markov chain Stochastic Processes and their Applications 47 ( 1993) 113-117 North-Holland 113 On the Central Limit Theorem for an ergodic Markov chain K.S. Chan Department of Statistics and Actuarial Science, The University

More information

RESEARCH REPORT. Estimation of sample spacing in stochastic processes. Anders Rønn-Nielsen, Jon Sporring and Eva B.

RESEARCH REPORT. Estimation of sample spacing in stochastic processes.   Anders Rønn-Nielsen, Jon Sporring and Eva B. CENTRE FOR STOCHASTIC GEOMETRY AND ADVANCED BIOIMAGING www.csgb.dk RESEARCH REPORT 6 Anders Rønn-Nielsen, Jon Sporring and Eva B. Vedel Jensen Estimation of sample spacing in stochastic processes No. 7,

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan Monte-Carlo MMD-MA, Université Paris-Dauphine Xiaolu Tan tan@ceremade.dauphine.fr Septembre 2015 Contents 1 Introduction 1 1.1 The principle.................................. 1 1.2 The error analysis

More information

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains

A regeneration proof of the central limit theorem for uniformly ergodic Markov chains A regeneration proof of the central limit theorem for uniformly ergodic Markov chains By AJAY JASRA Department of Mathematics, Imperial College London, SW7 2AZ, London, UK and CHAO YANG Department of Mathematics,

More information

Asymptotic Properties of the Maximum Likelihood Estimator in Regime Switching Econometric Models

Asymptotic Properties of the Maximum Likelihood Estimator in Regime Switching Econometric Models CIRJE-F-1049 Asymptotic Properties of the Maximum Likelihood Estimator in Regime Switching Econometric Models Hiroyuki Kasahara University of British Columbia Katsumi Shimotsu The University of Tokyo May

More information

A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction

A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction tm Tatra Mt. Math. Publ. 00 (XXXX), 1 10 A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES Martin Janžura and Lucie Fialová ABSTRACT. A parametric model for statistical analysis of Markov chains type

More information

STOCHASTIC GEOMETRY BIOIMAGING

STOCHASTIC GEOMETRY BIOIMAGING CENTRE FOR STOCHASTIC GEOMETRY AND ADVANCED BIOIMAGING 2018 www.csgb.dk RESEARCH REPORT Anders Rønn-Nielsen and Eva B. Vedel Jensen Central limit theorem for mean and variogram estimators in Lévy based

More information

Stein s Method: Distributional Approximation and Concentration of Measure

Stein s Method: Distributional Approximation and Concentration of Measure Stein s Method: Distributional Approximation and Concentration of Measure Larry Goldstein University of Southern California 36 th Midwest Probability Colloquium, 2014 Concentration of Measure Distributional

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Estimates for probabilities of independent events and infinite series

Estimates for probabilities of independent events and infinite series Estimates for probabilities of independent events and infinite series Jürgen Grahl and Shahar evo September 9, 06 arxiv:609.0894v [math.pr] 8 Sep 06 Abstract This paper deals with finite or infinite sequences

More information

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures 2 1 Borel Regular Measures We now state and prove an important regularity property of Borel regular outer measures: Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon

More information

Asymptotics for posterior hazards

Asymptotics for posterior hazards Asymptotics for posterior hazards Pierpaolo De Blasi University of Turin 10th August 2007, BNR Workshop, Isaac Newton Intitute, Cambridge, UK Joint work with Giovanni Peccati (Université Paris VI) and

More information

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes Lecture Notes 7 Random Processes Definition IID Processes Bernoulli Process Binomial Counting Process Interarrival Time Process Markov Processes Markov Chains Classification of States Steady State Probabilities

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R.

Ergodic Theorems. Samy Tindel. Purdue University. Probability Theory 2 - MA 539. Taken from Probability: Theory and examples by R. Ergodic Theorems Samy Tindel Purdue University Probability Theory 2 - MA 539 Taken from Probability: Theory and examples by R. Durrett Samy T. Ergodic theorems Probability Theory 1 / 92 Outline 1 Definitions

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Concentration of Measures by Bounded Size Bias Couplings

Concentration of Measures by Bounded Size Bias Couplings Concentration of Measures by Bounded Size Bias Couplings Subhankar Ghosh, Larry Goldstein University of Southern California [arxiv:0906.3886] January 10 th, 2013 Concentration of Measure Distributional

More information

Solutions to Problem Set 5

Solutions to Problem Set 5 UC Berkeley, CS 74: Combinatorics and Discrete Probability (Fall 00 Solutions to Problem Set (MU 60 A family of subsets F of {,,, n} is called an antichain if there is no pair of sets A and B in F satisfying

More information

simple if it completely specifies the density of x

simple if it completely specifies the density of x 3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely

More information

Note Set 5: Hidden Markov Models

Note Set 5: Hidden Markov Models Note Set 5: Hidden Markov Models Probabilistic Learning: Theory and Algorithms, CS 274A, Winter 2016 1 Hidden Markov Models (HMMs) 1.1 Introduction Consider observed data vectors x t that are d-dimensional

More information

Hidden Markov Models and Gaussian Mixture Models

Hidden Markov Models and Gaussian Mixture Models Hidden Markov Models and Gaussian Mixture Models Hiroshi Shimodaira and Steve Renals Automatic Speech Recognition ASR Lectures 4&5 23&27 January 2014 ASR Lectures 4&5 Hidden Markov Models and Gaussian

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d

More information

Zero-sum square matrices

Zero-sum square matrices Zero-sum square matrices Paul Balister Yair Caro Cecil Rousseau Raphael Yuster Abstract Let A be a matrix over the integers, and let p be a positive integer. A submatrix B of A is zero-sum mod p if the

More information

On hitting times and fastest strong stationary times for skip-free and more general chains

On hitting times and fastest strong stationary times for skip-free and more general chains On hitting times and fastest strong stationary times for skip-free and more general chains James Allen Fill 1 Department of Applied Mathematics and Statistics The Johns Hopkins University jimfill@jhu.edu

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING

INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING ERIC SHANG Abstract. This paper provides an introduction to Markov chains and their basic classifications and interesting properties. After establishing

More information

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321 Lecture 11: Introduction to Markov Chains Copyright G. Caire (Sample Lectures) 321 Discrete-time random processes A sequence of RVs indexed by a variable n 2 {0, 1, 2,...} forms a discretetime random process

More information

This article was published in an Elsevier journal. The attached copy is furnished to the author for non-commercial research and education use, including for instruction at the author s institution, sharing

More information

A Generalization of Wigner s Law

A Generalization of Wigner s Law A Generalization of Wigner s Law Inna Zakharevich June 2, 2005 Abstract We present a generalization of Wigner s semicircle law: we consider a sequence of probability distributions (p, p 2,... ), with mean

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Course 495: Advanced Statistical Machine Learning/Pattern Recognition Course 495: Advanced Statistical Machine Learning/Pattern Recognition Lecturer: Stefanos Zafeiriou Goal (Lectures): To present discrete and continuous valued probabilistic linear dynamical systems (HMMs

More information

A combinatorial problem related to Mahler s measure

A combinatorial problem related to Mahler s measure A combinatorial problem related to Mahler s measure W. Duke ABSTRACT. We give a generalization of a result of Myerson on the asymptotic behavior of norms of certain Gaussian periods. The proof exploits

More information

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected

Example: physical systems. If the state space. Example: speech recognition. Context can be. Example: epidemics. Suppose each infected 4. Markov Chains A discrete time process {X n,n = 0,1,2,...} with discrete state space X n {0,1,2,...} is a Markov chain if it has the Markov property: P[X n+1 =j X n =i,x n 1 =i n 1,...,X 0 =i 0 ] = P[X

More information

Notes on Poisson Approximation

Notes on Poisson Approximation Notes on Poisson Approximation A. D. Barbour* Universität Zürich Progress in Stein s Method, Singapore, January 2009 These notes are a supplement to the article Topics in Poisson Approximation, which appeared

More information

ON KRONECKER PRODUCTS OF CHARACTERS OF THE SYMMETRIC GROUPS WITH FEW COMPONENTS

ON KRONECKER PRODUCTS OF CHARACTERS OF THE SYMMETRIC GROUPS WITH FEW COMPONENTS ON KRONECKER PRODUCTS OF CHARACTERS OF THE SYMMETRIC GROUPS WITH FEW COMPONENTS C. BESSENRODT AND S. VAN WILLIGENBURG Abstract. Confirming a conjecture made by Bessenrodt and Kleshchev in 1999, we classify

More information

Estimation of parametric functions in Downton s bivariate exponential distribution

Estimation of parametric functions in Downton s bivariate exponential distribution Estimation of parametric functions in Downton s bivariate exponential distribution George Iliopoulos Department of Mathematics University of the Aegean 83200 Karlovasi, Samos, Greece e-mail: geh@aegean.gr

More information

Lines With Many Points On Both Sides

Lines With Many Points On Both Sides Lines With Many Points On Both Sides Rom Pinchasi Hebrew University of Jerusalem and Massachusetts Institute of Technology September 13, 2002 Abstract Let G be a finite set of points in the plane. A line

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

List coloring hypergraphs

List coloring hypergraphs List coloring hypergraphs Penny Haxell Jacques Verstraete Department of Combinatorics and Optimization University of Waterloo Waterloo, Ontario, Canada pehaxell@uwaterloo.ca Department of Mathematics University

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 ) Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data Song Xi CHEN Guanghua School of Management and Center for Statistical Science, Peking University Department

More information

Feedback Capacity of a Class of Symmetric Finite-State Markov Channels

Feedback Capacity of a Class of Symmetric Finite-State Markov Channels Feedback Capacity of a Class of Symmetric Finite-State Markov Channels Nevroz Şen, Fady Alajaji and Serdar Yüksel Department of Mathematics and Statistics Queen s University Kingston, ON K7L 3N6, Canada

More information

arxiv: v2 [math.pr] 4 Sep 2017

arxiv: v2 [math.pr] 4 Sep 2017 arxiv:1708.08576v2 [math.pr] 4 Sep 2017 On the Speed of an Excited Asymmetric Random Walk Mike Cinkoske, Joe Jackson, Claire Plunkett September 5, 2017 Abstract An excited random walk is a non-markovian

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning

More information

KALMAN-TYPE RECURSIONS FOR TIME-VARYING ARMA MODELS AND THEIR IMPLICATION FOR LEAST SQUARES PROCEDURE ANTONY G AU T I E R (LILLE)

KALMAN-TYPE RECURSIONS FOR TIME-VARYING ARMA MODELS AND THEIR IMPLICATION FOR LEAST SQUARES PROCEDURE ANTONY G AU T I E R (LILLE) PROBABILITY AND MATHEMATICAL STATISTICS Vol 29, Fasc 1 (29), pp 169 18 KALMAN-TYPE RECURSIONS FOR TIME-VARYING ARMA MODELS AND THEIR IMPLICATION FOR LEAST SQUARES PROCEDURE BY ANTONY G AU T I E R (LILLE)

More information

arxiv: v1 [math.co] 3 Nov 2014

arxiv: v1 [math.co] 3 Nov 2014 SPARSE MATRICES DESCRIBING ITERATIONS OF INTEGER-VALUED FUNCTIONS BERND C. KELLNER arxiv:1411.0590v1 [math.co] 3 Nov 014 Abstract. We consider iterations of integer-valued functions φ, which have no fixed

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

MATH 205C: STATIONARY PHASE LEMMA

MATH 205C: STATIONARY PHASE LEMMA MATH 205C: STATIONARY PHASE LEMMA For ω, consider an integral of the form I(ω) = e iωf(x) u(x) dx, where u Cc (R n ) complex valued, with support in a compact set K, and f C (R n ) real valued. Thus, I(ω)

More information

arxiv: v3 [math.ac] 29 Aug 2018

arxiv: v3 [math.ac] 29 Aug 2018 ON THE LOCAL K-ELASTICITIES OF PUISEUX MONOIDS MARLY GOTTI arxiv:1712.00837v3 [math.ac] 29 Aug 2018 Abstract. If M is an atomic monoid and x is a nonzero non-unit element of M, then the set of lengths

More information

Estimation of Dynamic Regression Models

Estimation of Dynamic Regression Models University of Pavia 2007 Estimation of Dynamic Regression Models Eduardo Rossi University of Pavia Factorization of the density DGP: D t (x t χ t 1, d t ; Ψ) x t represent all the variables in the economy.

More information

HIDDEN MARKOV CHANGE POINT ESTIMATION

HIDDEN MARKOV CHANGE POINT ESTIMATION Communications on Stochastic Analysis Vol. 9, No. 3 (2015 367-374 Serials Publications www.serialspublications.com HIDDEN MARKOV CHANGE POINT ESTIMATION ROBERT J. ELLIOTT* AND SEBASTIAN ELLIOTT Abstract.

More information

Using Markov Chains To Model Human Migration in a Network Equilibrium Framework

Using Markov Chains To Model Human Migration in a Network Equilibrium Framework Using Markov Chains To Model Human Migration in a Network Equilibrium Framework Jie Pan Department of Mathematics and Computer Science Saint Joseph s University Philadelphia, PA 19131 Anna Nagurney School

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Multivariate Analysis and Likelihood Inference

Multivariate Analysis and Likelihood Inference Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density

More information

Z-estimators (generalized method of moments)

Z-estimators (generalized method of moments) Z-estimators (generalized method of moments) Consider the estimation of an unknown parameter θ in a set, based on data x = (x,...,x n ) R n. Each function h(x, ) on defines a Z-estimator θ n = θ n (x,...,x

More information

HIGHER ORDER CUMULANTS OF RANDOM VECTORS, DIFFERENTIAL OPERATORS, AND APPLICATIONS TO STATISTICAL INFERENCE AND TIME SERIES

HIGHER ORDER CUMULANTS OF RANDOM VECTORS, DIFFERENTIAL OPERATORS, AND APPLICATIONS TO STATISTICAL INFERENCE AND TIME SERIES HIGHER ORDER CUMULANTS OF RANDOM VECTORS DIFFERENTIAL OPERATORS AND APPLICATIONS TO STATISTICAL INFERENCE AND TIME SERIES S RAO JAMMALAMADAKA T SUBBA RAO AND GYÖRGY TERDIK Abstract This paper provides

More information

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses Ann Inst Stat Math (2009) 61:773 787 DOI 10.1007/s10463-008-0172-6 Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses Taisuke Otsu Received: 1 June 2007 / Revised:

More information

A CLT FOR MULTI-DIMENSIONAL MARTINGALE DIFFERENCES IN A LEXICOGRAPHIC ORDER GUY COHEN. Dedicated to the memory of Mikhail Gordin

A CLT FOR MULTI-DIMENSIONAL MARTINGALE DIFFERENCES IN A LEXICOGRAPHIC ORDER GUY COHEN. Dedicated to the memory of Mikhail Gordin A CLT FOR MULTI-DIMENSIONAL MARTINGALE DIFFERENCES IN A LEXICOGRAPHIC ORDER GUY COHEN Dedicated to the memory of Mikhail Gordin Abstract. We prove a central limit theorem for a square-integrable ergodic

More information