Nonparametric inference in hidden Markov and related models

Nonparametric inference in hidden Markov and related models Roland Langrock, Bielefeld University Roland Langrock Bielefeld University 1 / 47

Introduction and motivation Roland Langrock Bielefeld University 2 / 47

Figure: Haggis (the dish). Roland Langrock Bielefeld University 3 / 47

Figure: Wild Haggis (Dux magnus gentis venteris saginati). Roland Langrock Bielefeld University 4 / 47

Introducing hidden Markov models using Wild Haggis movement simulated movement track Pr(S t = j S t 1 = j) = 0.95 for j = 1, 2 where S t: state at time t Roland Langrock Bielefeld University 5 / 47

Introducing hidden Markov models using Wild Haggis movement simulated movement track step length distributions density 0.0 0.1 0.2 0.3 0.4 0 5 10 15 20 25 30 step length turning angle distributions density 0.0 0.2 0.4 0.6 0.8 3 2 1 0 1 2 3 turning angle Pr(S t = j S t 1 = j) = 0.95 for j = 1, 2, where S t: state at time t Roland Langrock Bielefeld University 6 / 47

Examples of HMM-type models/doubly stochastic processes hidden Markov models (HMMs) (general) state-space models Markov-switching regression Cox point processes X t 1 0X t 0 X t+1 (observed) In each case two components: S t 1 0S t 0 S t+1...... (hidden) an observable state-dependent process 1.) in animal movement: e.g. step lengths & turning angles 2.) in financial time series: some economic indicator, e.g. GDP values 3.) in disease progression: e.g. blood samples a latent (nonobservable) state process/system process in 1.): behavioural state in 2.): the nervousness of the market in 3.): the disease stage Roland Langrock Bielefeld University 7 / 47

Inference in HMM-type models Why nonparametric? 1. specifying a suitable model can be hard lots of ways to get it wrong! 2. more flexibility, perhaps leading to models that are more parsimonious, e.g. in terms of the number of states 3. as an exploratory tool A strategy applicable in many scenarios combines the simple yet powerful HMM machinery...... and the conceptual simplicity and general advantages of P-splines Roland Langrock Bielefeld University 8 / 47

1 Some basics on hidden Markov models 2 Nonparametric inference in hidden Markov models 3 Markov-switching generalized additive models 4 Concluding remarks Roland Langrock Bielefeld University 9 / 47

Some basics on hidden Markov models Roland Langrock Bielefeld University 10 / 47

HMMs summary/definition X t 1 0X t 0 X t+1 (observed) S t 1 0S t 0 S t+1...... (hidden) two (discrete-time) stochastic processes, one of them hidden distribution of observations determined by underlying state hidden state process is an N-state Markov chain Roland Langrock Bielefeld University 11 / 47

Building blocks of HMMs {S t} t=1,2,...,t is (usually) assumed to be an N-state Markov chain: state transition probabilities: γ ij = Pr(S t = j S t 1 = i) transition probability matrix (t.p.m.): γ 11... γ 1N Γ =....... γ N1... γ NN initial state distribution: δ = ( Pr(S 1 = 1),..., Pr(S 1 = N) ) State-dependent distributions f (x t s t = j): specify suitable class of parametric distributions e.g. normal, Poisson, Bernoulli, multivariate normal, gamma, Dirichlet,... one set of parameters for each state Roland Langrock Bielefeld University 12 / 47

HMMs likelihood calculation using brute force L(θ) = f (x 1,..., x T ) N N =... f (x 1,..., x T, s 1,..., s T ) = = s 1 =1 s T =1 N N... f (x 1,..., x T s 1,..., s T )f (s 1,..., s T ) s 1 =1 N... s T =1 N δ s1 T T f (x t s t) s 1 =1 s T =1 t=1 t=2 γ st 1,s t Simple form, but O(TN T ), numerical maximiz. of this expression thus infeasible. Roland Langrock Bielefeld University 13 / 47

HMMs likelihood calculation via forward algorithm Consider instead the so-called forward probabilities, α t(j) = f (x 1,..., x t, s t = j) These can be calculated using an efficient recursive scheme: α 1 = δq(x 1) α t = α t 1ΓQ(x t) with Q(x t) = diag ( f (x t s t = 1),..., f (x t s t = N) ) L(θ) = N α T (j) = δq(x 1)ΓQ(x 2)... ΓQ(x T )1 j=1 Computational effort: O(TN 2 ) linear in T! Roland Langrock Bielefeld University 14 / 47

Further inference a brief overview uncertainty quantification (parametric) bootstrap or Hessian-based model selection criteria such as the AIC model checking quantile residuals, simulation-based,... state decoding Viterbi algorithm Roland Langrock Bielefeld University 15 / 47

Related model classes state-space models can be approximated arbitrarily accurately by HMMs by finely discretizing the state space Markov-switching regression models HMMs with covariates Markov-modulated Poisson processes can be regarded as HMMs (with slightly modified dependence structure) The corresponding likelihoods can be written as easy-to-evaluate matrix products! Roland Langrock Bielefeld University 16 / 47

Nonparametric inference in hidden Markov models Roland Langrock Bielefeld University 17 / 47

HMMs motivation for a nonparametric approach distribution of observations selected by underlying state state-dependent distributions usually from a class of parametric distributions finding the right distribution, or even a suitable one, can be difficult an unfortunate choice can lead to...... a poor fit and hence poor predictive power... a bad performance of the state decoding... invalid inference e.g. on the number of states observed time series histogram of observations observations 40 20 0 20 40 Frequency 0 20 40 60 0 200 400 600 800 1000 time 60 40 20 0 20 40 60 observations What family of distributions to use for the state-dependent process? Roland Langrock Bielefeld University 18 / 47

Nonparametric estimation based on P-splines represent densities of state-dep. distributions using standardized B-spline basis densities: f (x t s t = i) = K k= K ai,kφk(xt) transform constrained parameters a i, K,..., a i,k : a i,k = exp(β i,k) K j= K exp(βi,j) with β i,0 = 0 numerically maximize the penalized log-likelihood: l p(θ, λ) = log ( L(θ) ) [ N ] λ i K ( ) 2 2 a i,k 2 i=1 k= K +2 Roland Langrock Bielefeld University 19 / 47

Inference identifiability holds under fairly weak conditions (essentially there needs to be serial correlation) generalized cross-validation or AIC-type statistic for (i) choosing λ from N-dimensional grid (ii) model selection on the number of states parameter estimation by numerical maximization of l p(θ, λ) local maxima can be an issue use many different initial values in the maximization uncertainty quantification via parametric bootstrap model checking via pseudo-residuals (standard) state decoding using Viterbi (standard) Roland Langrock Bielefeld University 20 / 47

A simple simulation experiment simulate T = 800 observations from 2-state HMM ( ) 0.9 0.1 Γ = 0.1 0.9 true densities of the state dep. distributions density 0.00 0.01 0.02 0.03 0.04 60 40 20 0 20 40 60 80 Roland Langrock Bielefeld University 21 / 47

A simple simulation experiment simulate T = 800 observations from 2-state HMM ( ) 0.9 0.1 Γ = 0.1 0.9 marginal distribution of obs. density 0.000 0.005 0.010 0.015 0.020 60 40 20 0 20 40 60 80 Roland Langrock Bielefeld University 22 / 47

A simple simulation experiment simulate T = 800 observations from 2-state HMM ( ) 0.9 0.1 Γ = 0.1 0.9 K = 15, thus 2K + 1 = 31 B-spline basis functions true (black) and estimated densities of the state dep. distributions density 0.00 0.01 0.02 0.03 0.04 lambdas about right 60 40 20 0 20 40 60 80 Roland Langrock Bielefeld University 23 / 47

A simple simulation experiment simulate T = 800 observations from 2-state HMM ( ) 0.9 0.1 Γ = 0.1 0.9 K = 15, thus 2K + 1 = 31 B-spline basis functions true (black) and estimated densities of the state dep. distributions density 0.00 0.01 0.02 0.03 0.04 lambdas too big 60 40 20 0 20 40 60 80 Roland Langrock Bielefeld University 24 / 47

A simple simulation experiment simulate T = 800 observations from 2-state HMM ( ) 0.9 0.1 Γ = 0.1 0.9 K = 15, thus 2K + 1 = 31 B-spline basis functions true (black) and estimated densities of the state dep. distributions density 0.00 0.01 0.02 0.03 0.04 lambdas too small 60 40 20 0 20 40 60 80 Roland Langrock Bielefeld University 25 / 47

Blainville s beaked whale dive data log( depth displacement in meters) 6 4 2 0 2 4 observed time series 10 20 30 40 time in hours histogram of the observations sample ACF Density value 0.00 0.05 0.10 0.15 0.20 0.25 0.30 ACF 0.0 0.2 0.4 0.6 0.8 1.0 8 6 4 2 0 2 4 6 log( depth displacement in meters) 0 5 10 15 20 25 30 lag Roland Langrock Bielefeld University 26 / 47

Blainville s beaked whale parametric HMMs Table: Results of fitting HMMs with normal state-dependent distributions. #states p AIC BIC 3 12 9784.00 9855.59 4 20 9498.16 9617.47 5 30 9400.30 9579.27 6 42 9294.88 9545.43 7 56 9208.04 9542.11 8 72 9129.15 9558.67 9 90 9090.98 9627.87 10 110 9064.53 9720.74 Roland Langrock Bielefeld University 27 / 47

Blainville s beaked whale parametric HMM, N = 7 fitted state dependent distributions (3 state parametric HMM) log(absolute depth displacement) Density 6 4 2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 state 1 state 2 state 3 state 4 state 5 state 6 state 7 marginal 3 2 1 0 1 2 3 3 2 1 0 1 2 3 qq plot of residuals against standard normal quantiles of the standard normal sample quantiles 0 5 10 15 20 25 30 0.0 0.2 0.4 0.6 0.8 1.0 sample ACF for series of residuals lag ACF Roland Langrock Bielefeld University 28 / 47

Blainville s beaked whale parametric HMM, N = 3 fitted state dependent distributions (3 state parametric HMM) log(absolute depth displacement) Density 6 4 2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 state 1 state 2 state 3 marginal 3 2 1 0 1 2 3 3 2 1 0 1 2 3 qq plot of residuals against standard normal quantiles of the standard normal sample quantiles 0 5 10 15 20 25 30 0.0 0.2 0.4 0.6 0.8 1.0 sample ACF for series of residuals lag ACF Roland Langrock Bielefeld University 29 / 47

Blainville s beaked whale nonparametric HMM with N = 3 fitted state dependent distributions (3 state nonparametric HMM) log(absolute depth displacement) Density 6 4 2 0 2 4 6 0.00 0.05 0.10 0.15 0.20 0.25 0.30 state 1 state 2 state 3 marginal 3 2 1 0 1 2 3 3 2 1 0 1 2 3 qq plot of residuals against standard normal quantiles of the standard normal sample quantiles 0 5 10 15 20 25 30 0.0 0.2 0.4 0.6 0.8 1.0 sample ACF for series of residuals lag ACF Roland Langrock Bielefeld University 30 / 47

Blainville s beaked whale Viterbi for nonparametric HMM with N = 3 depths in meters log( depth displacement in meters) 4 2 0 2 4 1200 1000 800 600 400 200 0 decoded states state 1 state 2 state 3 2 4 6 8 time in hours Roland Langrock Bielefeld University 31 / 47

Markov-switching generalized additive models Roland Langrock Bielefeld University 32 / 47

Markov-switching regression a basic model A simple Markov-switching (linear) regression model: with Y t = β (s t ) 0 + β (s t ) 1 x t + σ st ɛ t, a time series {Y t} t=1,...,t associated covariates x 1,..., x T (including the possibility of x t = y t 1) ɛ iid t N (0, 1) s t: state at time t of an unobservable N-state Markov chain Roland Langrock Bielefeld University 33 / 47

Markov-switching regression remarks on the basic model commonly used in economics to deal with parameter instability over time (key references: Goldfeld and Quandt, 1973; Hamilton, 1989) linear form of the predictor is usually assumed with little investigation (if any!) into the absolute or relative goodness of fit we consider nonparametric methods for estimating the form of the predictor (in analogy to the extension of linear models to GAMs) Roland Langrock Bielefeld University 34 / 47

Markov-switching regression more general model formulation More general model formulation: g ( ) E(Y t s t, x t) = η (s t ) (x t), }{{} µ (s t ) t where Y t follows some distribution from the exponential family x t = (x 1t,..., x Pt) is the covariate vector at time t g is a suitable link function η (s t ) is the predictor function given state s t (the form of which we do not yet specify) (φ (s t ) : any additional state-dependent dispersion parameters) Roland Langrock Bielefeld University 35 / 47

Likelihood evaluation using the forward recursion Define, analogously as for HMMs, the forward variable α t(j) = f (y 1,..., y t, S t = j x 1... x t) Then the following recursive scheme can be applied: α 1 = δq(y 1), α t = α t 1ΓQ(y t) (t = 2,..., T ) where Q(y t) = diag ( p Y (y t; µ (1) t, φ (1) ),..., p Y (y t; µ (N) t, φ (N) ) ) L(θ) = N α T (j) = δq(x 1)ΓQ(x 2)... ΓQ(x T )1 j=1 This form applies for any form of the conditional density p Y (y t; µ (s t ) t, φ (s t ) ) Roland Langrock Bielefeld University 36 / 47

Nonparametric modelling of the predictor here we consider a GAM-type framework: η (s t ) (x t) = β (s t ) 0 + f (s t ) 1 (x 1t) + f (s t ) 2 (x 2t) +... + f (s t ) P (x Pt), we represent each f (i) p as a linear combination of B-spline basis functions: f (i) p (x) = K γ ipk B k(x) k=1... and numerically maximize the penalized log-likelihood: l p(θ, λ) = log ( L(θ) ) N P i=1 p=1 λ ip 2 K ( 2 γ ipk) 2 k=3 inference analogous as for nonparametric HMMs notably, parametric models are nested special cases (for λ ) Roland Langrock Bielefeld University 37 / 47

A simple simulation experiment simulate T = 300 observations from 2-state Markov-switching regr. model: ( ) Y t Poisson(e β 0+f (s t ) (x t ) 0.9 0.1 ), Γ = 0.1 0.9 f (s t) (xt) 6 4 2 0 2 4 6 s t=1 (state 1) s t=2 (state 2) 3 2 1 0 1 2 3 x t Roland Langrock Bielefeld University 38 / 47

A simple simulation experiment simulate T = 300 observations from 2-state Markov-switching regr. model: ( ) Y t Poisson(e β 0+f (s t ) (x t ) 0.9 0.1 ), Γ = 0.1 0.9 K = 15, thus 2K + 1 = 31 B-spline basis functions smoothing parameter selection from a grid using AIC-type statistic f (s t) (xt) 6 4 2 0 2 4 6 s t=1 (state 1) s t=2 (state 2) 3 2 1 0 1 2 3 x t Roland Langrock Bielefeld University 39 / 47

A simple simulation experiment simulate T = 300 observations from 2-state Markov-switching regr. model: ( ) Y t Poisson(e β 0+f (s t ) (x t ) 0.6 0.4 ), Γ = 0.4 0.6 K = 15, thus 2K + 1 = 31 B-spline basis functions smoothing parameter selection from a grid using AIC-type statistic f (s t) (xt) 6 4 2 0 2 4 6 s t=1 (state 1) s t=2 (state 2) 3 2 1 0 1 2 3 x t Roland Langrock Bielefeld University 40 / 47

1.0 1.5 sales (in million USD) 2.0 2.5 3.0 3.5 Example Lydia Pinkham sales 1910 Roland Langrock Bielefeld University 1920 1930 1940 year 1950 1960 41 / 47

Example Lydia Pinkham sales Model MS-LIN: sales t = β (s t ) 0 + β (s t ) 1 advertising t + β (s t ) 2 sales t 1 + σ st ɛ t Model MS-GAM: sales t = β (s t ) 0 + f (s t ) (advertising t ) + β (s t ) 1 sales t 1 + σ st ɛ t MS LIN MS GAM Sales 1.5 2.0 2.5 3.0 Sales 1.5 2.0 2.5 3.0 0.5 1.0 1.5 2.0 Advertising 0.5 1.0 1.5 2.0 Advertising Figure: Estimated state-dependent mean sales as functions of advertising expenditure (state 1 in green, state 2 in red). Displayed are the predictor values when fixing the regressor sales t 1 at its overall mean, 1.84. Roland Langrock Bielefeld University 42 / 47

Example Lydia Pinkham sales sales 1.0 1.5 2.0 2.5 3.0 3.5 1910 1920 1930 1940 1950 1960 year state 1 2 1910 1920 1930 1940 1950 1960 year Figure: Sales figures and decoded states underlying the MS-GAM model. Roland Langrock Bielefeld University 43 / 47

Example Lydia Pinkham sales ACF MS LIN residuals qq plot MS LIN residuals ACF 0.2 0.2 0.6 1.0 Sample Quantiles 3 2 1 0 1 2 3 0 5 10 15 Lag 3 2 1 0 1 2 3 Theoretical Quantiles ACF MS GAM residuals qq plot MS GAM residuals ACF 0.2 0.2 0.6 1.0 Sample Quantiles 3 2 1 0 1 2 3 0 5 10 15 Lag 3 2 1 0 1 2 3 Theoretical Quantiles Roland Langrock Bielefeld University 44 / 47

Concluding remarks Roland Langrock Bielefeld University 45 / 47

Concluding remarks bringing together HMMs & P-splines gives lots of modelling options while inference is slightly more involved, resulting models often substantially increase the goodness of fit, and may in fact be more parsimonious than parametric alternatives various other such models can be formulated (and fitted), e.g. MS-GAMLSS models but does anyone need this kind of thing?? we re currently working on alternative, less computer-intensive methods for selecting the smoothing parameters Roland Langrock Bielefeld University 46 / 47

References Langrock, R., Kneib, T., Sohn, A., DeRuiter, S. (2015), Nonparametric inference in hidden Markov models using P-splines, Biometrics Langrock, R., Glennie, R., Kneib, T., Michelot, T. (2016). Markov-switching generalized additive models, Statistics and Computing Thank you! Roland Langrock Bielefeld University 47 / 47