Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models

Size: px

Start display at page:

Download "Dynamic models. Dependent data The AR(p) model The MA(q) model Hidden Markov models. 6 Dynamic models"

Matthew Williamson
5 years ago
Views:

1 6 Dependent data The AR(p) model The MA(q) model Hidden Markov models

2 Dependent data Dependent data Huge portion of real-life data involving dependent datapoints Example (Capture-recapture) capture histories capture sizes

3 Dependent data Eurostoxx 50 First four stock indices of of the financial index Eurostoxx ABN AMRO HOLDING AEGON t AHOLD KON t AIR LIQUIDE t t

4 Dependent data Markov chain Stochastic process (x t ) t T where distribution of x t given the past values x 0:(t 1) only depends on x t 1. Homogeneity: distribution of x t given the past constant in t T. Corresponding likelihood l(θ x 0:T ) = f 0 (x 0 θ) T f(x t x t 1,θ) t=1 [Homogeneity means f independent of t]

5 Dependent data Stationarity constraints Difference with the independent case: stationarity and causality constraints put restrictions on the parameter space

6 Dependent data Stationarity processes Definition (Stationary stochastic process) (x t ) t T is stationary if the joint distributions of (x 1,...,x k ) and (x 1+h,...,x k+h ) are the same for all h,k s. It is second-order stationary if, given the autocovariance function then γ x (r,s) = E[{x r E(x r )}{x s E(x s )}], r,s T, E(x t ) = µ and γ x (r,s) = γ x (r + t,s + t) γ x (r s) for all r,s,t T.

7 Dependent data Imposing or not imposing stationarity Bayesian inference on a non-stationary process can be [formaly] conducted Debate From a Bayesian point of view, to impose the stationarity condition is objectionable:stationarity requirement on finite datasets artificial and/or datasets themselves should indicate whether the model is stationary Reasons for imposing stationarity:asymptotics (Bayes estimators are not necessarily convergent in non-stationary settings) causality, identifiability and... common practice.

8 Dependent data Unknown stationarity constraints Practical difficulty: for complex models, stationarity constraints get quite involved to the point of being unknown in some cases

9 The AR(p) model The AR(1) model Case of linear Markovian dependence on the last value x t = µ + (x t 1 µ) + ǫ t,ǫ t i.i.d. N (0,σ 2 ) If < 1, (x t ) t Z can be written as x t = µ + j=0 and this is a stationary representation. jǫ t j

10 The AR(p) model Stationary but... If > 1, alternative stationary representation x t = µ j ǫ t+j. j=1 This stationary solution is criticized as artificial because x t is correlated with future white noises (ǫ t ) s>t, unlike the case when < 1. Non-causal representation...

11 The AR(p) model Standard constraint c Customary to restrict AR(1) processes to the case < 1 Thus use of a uniform prior on [ 1,1] for Exclusion of the case = 1 that leads to a random walk because the process is then a random walk [no stationary solution]

12 The AR(p) model The AR(p) model Conditional model x t x t 1,... N Generalisation of AR(1) ( µ + ) p i(x t i µ),σ 2 i=1 Among the most commonly used models in dynamic settings More challenging than the static models (stationarity constraints) Different models depending on the processing of the starting value x 0

13 The AR(p) model Stationarity+causality Stationarity constraints in the prior as a restriction on the values of θ. Theorem AR(p) model second-order stationary and causal iff the roots of the polynomial p P(x) = 1 ix i are all outside the unit circle i=1

14 The AR(p) model Initial conditions Unobserved initial values can be processed in various ways 1 All x i s (i > 0) set equal to µ, for computational convenience 2 Under stationarity and causality constraints, (x t ) t Z has a stationary distribution: Assume x p: 1 distributed from stationary N p (µ1 p,a) distribution Corresponding marginal likelihood Z T (! 2 ) Y σ T 1 px exp x 2σ 2 t µ i(x t i µ) t=0 i=1 f(x p: 1 µ,a)dx p: 1,

15 The AR(p) model Initial conditions (cont d) 3 Condition instead on the initial observed values x 0:(p 1) l c (µ, 1,..., p, σ x p:t,x 0:(p 1) ) ( ) T 2 p /2σ σ T exp x t µ i(x t i µ) 2. t=p i=1

16 The AR(p) model Prior selection For AR(1) model, Jeffreys prior associated with the stationary representation is π J 1 (µ,σ 2, ) 1 σ Extension to higher orders quite complicated ( part)! Natural conjugate prior for θ = (µ, 1,..., p,σ 2 ) : normal distribution on (µ, 1,..., p) and inverse gamma distribution on σ 2... and for constrained s?

17 The AR(p) model Stationarity constraints Under stationarity constraints, complex parameter space: each value of needs to be checked for roots of corresponding polynomial with modulus less than 1 E.g., for an AR(2) process with autoregressive polynomial P(u) = 1 1u 2u 2, constraint is < 1, 1 2 < 1 and 2 < 1. Skip Durbin forward

18 The AR(p) model A first useful reparameterisation Durbin Levinson recursion proposes a reparametrisation from the parameters i to the partial autocorrelations ψ i [ 1,1] which allow for a uniform prior on the hypercube. Partial autocorrelation defined as ψ i = corr (x t E[x t x t+1,...,x t+i 1 ], [see also Yule-Walker equations] x t+i E[x t+1 x t+1,...,x t+i 1 ])

19 The AR(p) model Durbin Levinson recursion Transform 1 Define ϕ ii = ψ i and ϕ ij = ϕ (i 1)j ψ i ϕ (i 1)(i j), for i > 1 and j = 1,,i 1. 2 Take i = ϕ pi for i = 1,,p.

20 The AR(p) model Stationarity & priors For AR(1) model, Jeffreys prior associated with the stationary representation is π J 1 (µ,σ 2, ) 1 σ Within the non-stationary region > 1, Jeffreys prior is π J 2 (µ,σ2, ) 1 σ T T(1 2). The dominant part of the prior is the non-stationary region!

21 The AR(p) model Alternative prior The reference prior π1 J is only defined when the stationary constraint holds. Idea Symmetrise to the region > 1 { π B (µ,σ 2, ) 1 1/ 1 2 if < 1, σ 2 1/ 2 1 if > 1,, pi x

22 The AR(p) model MCMC consequences When devising an MCMC algorithm, use the Durbin-Levinson recursion to end up with single normal simulations of the ψ i s since the j s are linear functions of the ψ i s

23 The AR(p) model Root parameterisation Skip Durbin back Lag polynomial representation ( ) p Id ib i x t = ǫ t i=1 with (inverse) roots p (Id λ i B) x t = ǫ t i=1 Closed form expression of the likelihood as a function of the (inverse) roots

24 The AR(p) model Uniform prior under stationarity Stationarity The λ i s are within the unit circle if in C [complex numbers] and within [ 1,1] if in R [real numbers] Naturally associated with a flat prior on either the unit circle or [ 1, 1] 1 1 k/ I 1 λ i <1 π I λ i <1 λ i R λ i R where k/2 + 1 number of possible cases Term k/2 + 1 is important for reversible jump applications

25 The AR(p) model MCMC consequences In a Gibbs sampler, each λ i can be simulated conditionaly on the others since p (Id λ i B) x t = y t λ i y t 1 = ǫ t i=1 where Y t = i i (Id λ i B) x t

26 The AR(p) model Metropolis-Hastings implementation 1 use the prior π itself as a proposal on the (inverse) roots of P, selecting one or several roots of P to be simulated from π; 2 acceptance ratio is likelihood ratio 3 need to watch out for real/complex dichotomy

27 The AR(p) model A [paradoxical] reversible jump implementation Define model M 2k (0 k p/2 ) as corresponding to a number 2k of complex roots o k p/2 ) Moving from model M 2k to model M 2k+2 means that two real roots have been replaced by two conjugate complex roots. Propose jump from M 2k to M 2k+2 with probability 1/2 and from M 2k to M 2k 2 with probability 1/2 [boundary exceptions] accept move from M 2k to M 2k+ or 2 with probability l c (µ, 1,..., p, σ x p:t,x 0:(p 1) ) l c (µ, 1,..., p, σ x p:t,x 0:(p 1) ) 1,

28 The AR(p) model Checking your code Try with no data and recover the prior k

29 The AR(p) model Checking your code Try with no data and recover the prior λ λ1

30 The AR(p) model Order estimation Typical setting for model choice: determine order p of AR(p) model Roots [may] change drastically from one p to the other. No difficulty from the previous perspective: recycle above reversible jump algorithm

31 The AR(p) model AR(?) reversible jump algorithm Use (purely birth-and-death) proposals based on the uniform prior k k+1 [Creation of real root] k k+2 [Creation of complex root] k k-1 [Deletion of real root] k k-2 [Deletion of complex root]

32 The AR(p) model Reversible jump output xt t p αi p=3 p=4 p= αi αi αi p=3 means mean Im(λi) α AR(3) simulated dataset of 530 points (upper left) with true parameters α i ( 0.1, 0.3, 0.4) and σ = 1. First histogram associated with p, following histograms with the α i s, for different values of p, and of σ 2. Final graph: scatterplot of the complex roots. One before last: evolution of α 1, α 2, α 3. σ 2 Iterations Re(λi)

33 The MA(q) model The MA(q) model Alternative type of time series q x t = µ + ǫ t ϑ j ǫ t j, ǫ t N(0,σ 2 ) j=1 Stationary but, for identifiability considerations, the polynomial q Q(x) = 1 ϑ j x j j=1 must have all its roots outside the unit circle.

34 The MA(q) model Identifiability Example For the MA(1) model, x t = µ + ǫ t ϑ 1 ǫ t 1, var(x t ) = (1 + ϑ 2 1 )σ2 can also be written x t = µ + ǫ t 1 1 t, ǫ N(0,ϑ ϑ 1 ǫ 2 1 σ2 ), Both pairs (ϑ 1,σ) & (1/ϑ 1,ϑ 1 σ) lead to alternative representations of the same model.

35 The MA(q) model Properties of MA models Non-Markovian model (but special case of hidden Markov) Autocovariance γ x (s) is null for s > q

36 The MA(q) model Representations x 1:T is a normal random variable with constant mean µ and covariance matrix σ 2 γ 1 γ 2... γ q γ 1 σ 2 γ 1... γ q 1 γ q Σ =..., γ 1 σ 2 with ( s q) q s γ s = σ 2 i=0 Not manageable in practice [large T s] ϑ i ϑ i+ s

37 The MA(q) model Representations (contd.) Conditional on past (ǫ 0,...,ǫ q+1 ), where (t > 0) L(µ, ϑ 1,..., ϑ q, σ x 1:T, ǫ 0,..., ǫ q+1 ) T q σ T exp x t µ + ϑ jˆǫ t j t=1 ˆǫ t = x t µ + j=1 2 /2σ 2, q ϑ jˆǫ t j, ˆǫ 0 = ǫ 0,..., ˆǫ 1 q = ǫ 1 q j=1 Recursive definition of the likelihood, still costly O(T q)

38 The MA(q) model Recycling the AR algorithm Same algorithm as in the AR(p) case when modifying the likelihood Simulation of the past noises ǫ i (i = 1,...,q) done via a Metropolis-Hastings step with target 0 f(ǫ 0,...,ǫ q+1 x 1:T,µ,σ,ϑ) i= q+1 e ǫ2 i /2σ2 T t=1 e bǫ2 t /2σ2,

39 The MA(q) model Representations (contd.) Encompassing approach for general time series models State-space representation x t = Gy t + ε t, (1) y t+1 = Fy t + ξ t, (2) (1) is the observation equation and (2) is the state equation Note As seen below, this is a special case of hidden Markov model

40 The MA(q) model MA(q) state-space representation For the MA(q) model, take y t = (ǫ t q,...,ǫ t 1,ǫ t ) and then y t+1 = y t + ǫ t x t = µ ( ϑ q ϑ q 1... ϑ 1 1 ) y t.

41 The MA(q) model MA(q) state-space representation (cont d) Example For the MA(1) model, observation equation with x t = (1 0)y t y t = (y 1t y 2t ) directed by the state equation ( ( ) 0 1 y t+1 = )y 0 0 t + ǫ t+1. 1ϑ1

42 The MA(q) model ARMA extension ARMA(p,q) model x t p ix t 1 = µ + ǫ t i=1 q ϑ j ǫ t j, ǫ t N(0,σ 2 ) Identical stationarity and identifiability conditions for both groups ( 1,..., p) and (ϑ 1,...,ϑ q ) j=1

43 The MA(q) model Reparameterisation Identical root representations p (Id λ i B)x t = i=1 State-space representation and q (Id η i B)ǫ t i=1 x t = x t = µ ( ϑ r 1 ϑ r 2... ϑ 1 1 ) y t y t+1 = B A r r 1 r yt + ǫt B., 0A 1 under the convention that m = 0 if m > p and ϑ m = 0 if m > q.

44 The MA(q) model Bayesian approximation Quasi-identical MCMC implementation: 1 Simulate ( 1,..., p) conditional on (ϑ 1,...,ϑ q ) and µ 2 Simulate (ϑ 1,...,ϑ q ) conditional on ( 1,..., p) and µ 3 Simulate (µ,σ) conditional on ( 1,..., p) and (ϑ 1,...,ϑ q ) c Code can be recycled almost as is!

45 Hidden Markov models Hidden Markov models Generalisation both of a mixture and of a state space model. Example Extension of a mixture model with Markov dependence x t z,x j j t N(µ zt,σz 2 t ), P(z t = u z j,j < t) = p zt 1 u, (u = 1,...,k) Label switching also strikes in this model!

46 Hidden Markov models Generic dependence graph x t x t+1 y t y t+1 (x t,y t ) x 0:(t 1),y 0:(t 1) f(y t y t 1 )f(x t y t )

47 Hidden Markov models Definition Observable series {x t } t 1 associated with a second process {y t } t 1, with a finite set of N possible values such that 1. indicators Y t have an homogeneous Markov dynamic p(y t y 1:t 1 ) = p(y t y t 1 ) = P yt 1 y t where y 1:t 1 denotes the sequence {y 1,y 2,...,y t 1 }. 2. Observables x t are independent conditionally on the indicators y t T p(x 1:T y 1:T ) = p(x t y t ) t=1

48 Hidden Markov models Dnadataset DNA sequence [made of A, C, G, and T s] corresponding to a complete HIV genome where A, C, G, and T have been recoded as 1,...,4. Possible modeling by a two-state hidden Markov model with Y = {1,2} and X = {1,2,3,4}

49 Hidden Markov models Parameterization For the Markov bit, transition matrix P = [p ij ] and initial distribution for the observables, where = P N p ij = 1 j=1 f i (x t ) = p(x t y t = i) = f(x t θ i ) usually within the same parametrized class of distributions.

50 Hidden Markov models Finite case When both hidden and observed chains are finite, with Y = {1,...,κ} and X = {1,...,k}, parameter θ made up of p probability vectors q 1 = (q 1 1,...,q1 k ),...,qκ = (q κ 1,...,qκ k ) Joint distribution of (x t,y t ) 0 t T y0 q y 0 x 0 T t=1 p yt 1 y t q yt x t,

51 Hidden Markov models Bayesian inference in the finite case Posterior of (θ, P) given (x t,y t ) t factorizes as π(θ, P) y0 κ i=1 j=1 k (qj i )n ij κ p i=1 j=1 p m ij ij, where n ij # of visits to state j by the x t s when the corresponding y t s are equal to i and m ij # of transitions from state i to state j on the hidden chain (y t ) t N Under a flat prior on p ij s and q i j s, posterior distributions are [almost] Dirichlet [initial distribution side effect]

52 Hidden Markov models MCMC implementation Finite State HMM Gibbs Sampler Initialization: 1 Generate random values of the p ij s and of the q i j s 2 Generate the hidden Markov chain (y t ) 0 t T by (i = 1, 2) P(y t = i) { p ii qx i 0 if t = 0, p yt 1i qx i t if t > 0, and compute the corresponding sufficient statistics

53 Hidden Markov models MCMC implementation (cont d) Finite State HMM Gibbs Sampler Iteration m (m 1): 1 Generate (p i1,..., p iκ ) D(1 + n i1,...,1 + n iκ ) (q i 1,...,qi k ) D(1 + m i1,...,1 + m ik ) and correct for missing initial probability by a MH step with acceptance probability y 0 / y0 2 Generate successively each y t (0 t T) by { p ii qx i P(y t = i x t, y t 1, y t+1 ) 1 p iy1 if t = 0, p yt 1i qx i t p iyt+1 if t > 0, and compute corresponding sufficient statistics

54 Hidden Markov models Dnadataset p p q q q q q q 3 2

55 Hidden Markov models Forward-Backward formulae Existence of a (magical) recurrence relation that provides the observed likelihood function in manageable computing time Called forward-backward or Baum Welch formulas

56 Hidden Markov models Observed likelihood computation Likelihood of the complete model simple: T l c (θ x,y) = p yt 1 y t f(x t θ yt ) t=2 but likelihood of the observed model is not: l(θ x) = l c (θ x,y) y {1,...,κ} T c O(κ T ) complexity

57 Hidden Markov models Forward-Backward paradox It is possible to express the (observed) likelihood L O (θ x) in O(T 2 κ) computations, based on the Markov property of the pair (x t,y t ). Direct to backward smoothing

58 Hidden Markov models Conditional distributions We have and p(y 1:t x 1:t ) = f(x t y t )p(y 1:t x 1:(t 1) ) p(x t x 1:(t 1) ) p(y 1:t x 1:(t 1) ) = k(y t y t 1 )p(y 1:(t 1) x 1:(t 1) ) [Smoothing/Bayes] [Prediction] where k(y t y t 1 ) = p yt 1 y t associated with the matrix P and f(x t y t ) = f(x t θ yt )

59 Hidden Markov models Update of predictive Therefore p(y 1:t x 1:t ) = p(y t x 1:(t 1) )f(x t y t ) p(x t x 1:(t 1) ) = f(x t y t )k(y t y t 1 ) p(y p(x t x 1:(t 1) ) 1:(t 1) x 1:(t 1) ) with the same order of complexity for p(y 1:t x 1:t ) as for p(x t x 1:(t 1) )

60 Hidden Markov models Propagation and actualization equations p(y t x 1:(t 1) ) = y 1:(t 1) p(y 1:(t 1) x 1:(t 1) )k(y t y t 1 ) and p(y t x 1:t ) = p(y t x 1:(t 1) )f(x t y t ) p(x t x 1:(t 1) ). [Propagation] [Actualization]

61 Hidden Markov models Forward backward equations (1) Evaluation of p(y t x 1:T ) t T by forward-backward algorithm Denote t T γ t (i) = P(y t = i x 1:T ) α t (i) = p(x 1:t,y t = i) β t (i) = p(x t+1:t y t = i)

62 Hidden Markov models Recurrence relations Then and α 1 (i) = f(x 1 y t = i) i κ α t+1 (j) = f(x t+1 y t+1 = j) α t (i)p ij j=1 i=1 β T (i) = 1 κ β t (i) = p ij f(x t+1 y t+1 = j)β t+1 (j) γ t (i) = α t(i)β t (i) κ α t (j)β t (j) j=1 [Forward] [Backward]

63 Hidden Markov models Extension of the recurrence relations For ξ t (i,j) = P(y t = i,y t+1 = j x 1:T ) i,j = 1,...,κ, we also have ξ t (i,j) = α t (i)p ij f(x t+1 y t = j)β t+1 (j) κ κ α t (i)p ij f(x t+1 y t+1 = j)β t+1 (j) i=1 j=1

64 Hidden Markov models Overflows and underflows On-line scalings of the α t (i) s and β T (i) s for each t by / κ / κ c t = 1 α t (i) and d t = 1 β t (i) i=1 i=1 avoid overflows or/and underflows for large datasets

65 Hidden Markov models Backward smoothing Recursive derivation of conditionals We have p(y s y s 1,x 1:t ) = p(y s y s 1,x s:t ) Therefore (s = T,T 1,...,1) [Markov property!] p(y s y s 1,x 1:T ) k(y s y s 1 )f(x s y s ) y s+1 p(y s+1 y s,x 1:T ) with p(y T y T 1,x 1:T ) k(y T y T 1 )f(x T y t ). [Backward equation]

66 Hidden Markov models End of the backward smoothing The first term is p(y 1 x 1:t ) π(y 1 )f(x 1 y 1 ) y 2 p(y 2 y 1,x 1:t ), with π stationary distribution of P The conditional for y s needs to be defined for each of the κ values of y s 1 c O(t κ 2 ) operations

67 Hidden Markov models Details Need to introduce unnormalized version of the conditionals p(y t y t 1,x 0:T ) such that p T(y T y T 1,x 0:T ) = p yt 1 y T f(x T y T ) κ p t (y t y t 1,x 1:T ) = p yt 1 y t f(x t y t ) p t+1 (i y t,x 1:T ) p 0(y 0 x 0:T ) = y0 f(x 0 y 0 ) i=1 κ p 1(i y 0,x 0:t ) i=1

68 Hidden Markov models Likelihood computation Bayes formula p(x 1:T ) = p(x 1:T y 1:T )p(y 1:T ) p(y 1:T x 1:T ) gives a representation of the likelihood based on the forward backward formulae and an arbitrary sequence x o 1:T (since the l.h.s. does not depend on x 1:T ). Obtained through the p t s as p(x 0:T ) = κ p 1(i x 0:T ) i=1

69 Hidden Markov models Prediction filter If Forward equations ϕ t (i) = p(y t = i x 1:t 1 ) ϕ 1 (j) = p(y 1 = j) ϕ t+1 (j) = 1 κ f(x t y t = i)ϕ t (i)p ij (t 1) c t i=1 where κ c t = f(x t y t = k)ϕ t (k), k=1

70 Hidden Markov models Likelihood computation (2) Follows the same principle as the backward equations The (log-)likelihood is thus [ t κ ] log p(x 1:t ) = log p(x t,y t = i x 1:(r 1) ) = r=1 i=1 [ t κ ] log f(x t y t = i)ϕ t (i) r=1 i=1

Basic math for biology

Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood