Stochastic processes and

Size: px

Start display at page:

Download "Stochastic processes and"

Merilyn Mitchell
6 years ago
Views:

Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.

1 Stochastic processes and Markov chains (part II) Wessel van Wieringen nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Amsterdam, The Netherlands

2 Example

3 Example DNA copy number of a genomic segment is simply the number of copies of that segment present in the cell under study. Healthy normal cell: chr 1 : 2 chr 22 : 2 chr X : 1 or 2 chr Y : 0 or 1

4 Example Chromosomes of a tumor cell Technique: SKY

5 Example The DNA copy number is often categorized into: L : loss : < 2 copies N : normal : 2 copies G : gain : > 2 copies In cancer: The number of DNA copy number aberrations accumulates with the progression of the disease. DNA copy number aberrations are believed to be irreversible. Let us model the accumulation process of DNA copy number aberrations.

6 Example State diagram for the accumulation process of a locus. G G G G N N N N N L L L L t 0 t 1 t 2 t 3 time

7 Example The associated initial distribution: and, associated transition matrix: with parameter constraints:

8 Example Calculate the probability of a loss, normal and gain at this locus after p generations: Using: These probabilities simplify to, e.g.:

9 Example In practice, a sample is only observed once the cancer has already developed. Hence, the number of generations p is unknown. This may be accommodated by modeling p as being Poisson distributed: This yields, e.g.:

10 Example So far, we only considered one locus. Hence: DNA i DNA i DNA i DNA i t 0 t 1 t 2 t 3

11 Example For multiple loci: DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA i DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA DNA t 0 t 1 t 2 t 3

12 Example Multiple loci multivariate problem. Complications: p unknown, loci not independent. Solution: p random, assume particular dependence structure. After likelihood formulation and parameter estimation: After likelihood formulation and parameter estimation: identify most aberrated loci, reconstruct time of onset of cancer.

13 Stationary distribution

14 Stationary distribution We generated a DNA sequence of bases long in accordance with a 1 st order Markov chain. For stretches DNA ever longer and ever farther away from the first base pair we calculated l the nucleotide %. bp bp %A %C %G %T e e e e e rgence e conve

15 Stationary distribution Hence, after a while an equilibrium sets in. Not necessarily a fixed state or pattern, but: the proportion of a time period that is spent in a particular state t converges to a limit it value. The limit values of all states form the stationary The limit values of all states form the stationary distribution of the Markov process, denoted by:

16 Stationary distribution For a stationary process, it holds that P(X t =E i ) = φ i for all t and i. In particular: P(X t =EE i ) = φ i = P(X t+1 =EE i ). Of course, this does not imply: P(X t =E i, X t+1 =E i )= φ i φ i as this ignores the 1 st order Markov dependency. d

17 Stationary distribution The stationary distribution is associated with the first- order Markov process, parameterized by (π, P). Question How do φ and (π, P) relate? Hereto, recall: φ i = P(X t =E i )

18 Stationary distribution We then obtain the following relation: definition

19 Stationary distribution We then obtain the following relation: use the fact that P(A, B) +P(A P(A, B C )=P(A)

20 Stationary distribution We then obtain the following relation: use the definition of conditional probability: P(A, B) = P(A B) P(B)

21 Stationary distribution We then obtain the following relation:

22 Stationary distribution We then obtain the following relation: Thus: Eigenvectors!

23 Stationary distribution Theorem Irreducible, aperiodic Markov chains with a finite it state t space S have a stationary distribution to which the chain converges as t. A C G T φ A =0.1, 01 φ C =0.2, 02 φ G =0.3, 03 φ T =0.4 04

24 Stationary distribution A Markov chain is aperiodic if there is no state that can only be reached at multiples of a certain period. E.g., state E i only at t = 0, 3, 6, 9, et cetera. Example of an aperiodic Markov chain A C G T p AC A A C p CA C p GC p GC G p GA p AG p TC T ATCGATCGATCGATCG p AA p GG G p GT p TG T p CT p CC p TT

25 Stationary distribution Example of a periodic Markov chain A C G T A C G T A p AT =1 p GA =1 C p CG =1 p TC =1 ATCGATCGATCGATCG G T The fraction of time spent in A (roughly φ A ): P(X t+1000 = A) ) = ¼ whereas P(X t+1000 = A X 4 = A) = 1.

26 Stationary distribution A Markov chain is irreducible if every state can (in principle) be reached (after enough time has passed) from every other state. Examples of a reducible Markov chain A C A C G T G T C is an absorbing state. A will not be reached.

27 Stationary distribution How do we find the stationary distribution? We know the stationary distribution satisfies: (*) and We thus have S+1 equations for S unknowns: one of the equations in (*) can be dropped (which is irrelevant), and the system of S remaining equations needs to be solved.

28 Stationary distribution Consider the transition matrix: In order to find the stationary distribution we need to solve the following system of equations: This yields: (φ 1, φ 2 ) T = ( , ) T

29 Stationary distribution On the other hand, for n large: P(X t+n =E j X t =E i ) = φ j is independent of i. Or,p (n) ij = φ j. Hence, the n-step transition matrix P (n) has identical rows: This motivates a numerical way to find the stationary This motivates a numerical way to find the stationary distribution.

30 Stationary distribution Same example as before: with stationary distribution: (φ 1, φ 2 ) T = ( , ) T Then: matrix multiplication ( rows times columns )

31 Stationary distribution Thus: In similar fashion we obtain:

32 Convergence of the stationary distribution

33 Convergence to the stationary distribution Define the vector 1 = (1,,1) T. We have already seen: P n = 1 φ T for large n Question How fast does P n go to 1 φ T as n? Answer 1) Use linear algebra 2) Calculate numerically

34 Convergence to the stationary distribution Fact The transition matrix P of a finite, it aperiodic, irreducible ibl Markov chain has an eigenvalue equal to 1 (λ 1 =1), while all other eigenvalues are (in the absolute sense) smaller than one: λ k < 1, k=2,,3. Focus on λ 1 =1 1 We know φ T P = φ T for the stationary distribution. Hence, φ is the left eigenvector of eigenvalue λ=1. Also, row sums of P equal 1: P 1= 1. Hence, 1 is a right eigenvector of eigenvalue λ=1.

35 Convergence to the stationary distribution The spectral decomposition of a square matrix P is given by: where: diagonal matrix containing the eigenvalues, columns contain the corresponding eigenvectors. In case of P is symmetric, Then: is orthogonal:

36 Convergence to the stationary distribution The spectral decomposition of P, reformulated: eigenvalues right eigenvectors left eigenvectors The eigenvectors are normalized:

37 Convergence to the stationary distribution We now obtain the spectral decomposition of the n- step transition matrix P n. Hereto observe that: plug in the spectral decomposition of P

38 Convergence to the stationary distribution We now obtain the spectral decomposition of the n- step transition matrix P n. Hereto observe that: bring the right eigenvector in the sum, and use the properties of the transpose operator

39 Convergence to the stationary distribution We now obtain the spectral decomposition of the n- step transition matrix P n. Hereto observe that: the eigenvectors are normalized

40 Convergence to the stationary distribution We now obtain the spectral decomposition of the n- step transition matrix P n. Hereto observe that:

41 Convergence to the stationary distribution Repeating this argument n times yields: Hence, and are left and right eigenvector with eigenvalue of. Thus:

42 Convergence to the stationary distribution Verify the spectral decomposition for P 2 :

43 Convergence to the stationary distribution Use the spectral decomposition of P n to show how fast P n converges to 1 φ T as n. We know: λ 1 =1, λ k < 1 for k=2,,s, and Then:

44 Convergence to the stationary distribution Expanding this: as n 0 0

45 Convergence to the stationary distribution Clearly: Furthermore, as: It is the second largest (in absolute sense) eigenvalue that dominates, and thus determines the convergence speed to the stationary distribution.

46 Convergence to the stationary distribution Fact AM Markov chain with a symmetric ti P has a uniform stationary distribution. Proof Symmetry of P implies that left- and right eigenvectors are identical (up to a constant). First right eigenvector corresponds vector of ones, 1. Hence, the left eigenvector equals c1. The left eigenvector is the stationary distribution and should sum to one: c = 1 / (number of states).

47 Convergence to the stationary distribution Question Suppose the DNA may reasonably described d by a first order Markov model with transition matrix P: and stationary distribution: and eigenvalues:

48 Convergence to the stationary distribution Question What is the probability of a G at position 2, 3, 4, 5, 10? And how does this depend on the starting nucleotide? In other words, give: But also: P(X 2 = G X 1 = A) ) = P(X 2 = G X 1 = C) = P(X 2 = G X 1 = G) = P(X 2 = G X 1 = T) = P(X 3 = G X 1 = A) = et cetera.

49 Convergence to the stationary distribution Thus, calculate P(X t = G X 1 = x 1 ) with t=2, 3, 4, 5, 10, and x 1 = A, C, G, T. x1=a x1=c x1=g x1=t t= t= t= t= t= Study the influence of the first nucleotide on the calculated probability for increasing t.

50 Convergence to the stationary distribution > # define π and P > pi <- matrix(c(1, 0, 0, 0), ncol=1) > P <- matrix(c(2, 3, 3, 2, 1, 4, 4, 1, 3, 2, 2, 3, 4, 1, 1, 4), ncol=4, byrow=true)/10 > # define function that calculates the powers of a > # matrix (inefficiently though) > matrixpower <- function(x, power){ Xpower <- X for (i in 2:power){ Xpower <- Xpower %*% X } return(xpower) } > # calculate P to the power 100 > matrixpower(p, 100)

51 Convergence to the stationary distribution Question Suppose the DNA may reasonably described d by a first order Markov model with transition matrix P: and stationary distribution: and eigenvalues:

52 Convergence to the stationary distribution Again calculate P(X t = G X 1 = x 1 ) with t=2, 3, 4, 5, 10, and x 1 = A, C, G, T. x1=a x1=c x1=g x1=t t= t= t= t= t= t= Now the influence of the first nucleotide fades slowly. This can be explained by the large 2 nd eigenvalue.

53 Processes back in time

54 Processes back in time So far, we have studied Markov chains forward in time. In practice, we may wish to study processes back in time. Example Evolutionary models that describe occurrence of SNPs in DNA sequences. We aim to attribute two DNA sequences to a common ancestor. human chimp common ancestor

55 Processes back in time Consider a Markov chain {X t } t=1,2,. The reverse Markov t t,, chain {X r *} r=1,2, is then defined by: X r * = X t-r X 2 * X 1 * X 0 * X -1 * X t-2 X t-1 X t X t+1 With transition probabilities: p ij = P(X t =E j X t-1 =E i ) p ij *=P(X r *=E j X r-1 *=E i )

56 Processes back in time Show how the transition probabilities p ji * relate to p ij : just the definition

57 Processes back in time Show how the transition probabilities p ji * relate to p ij : express this in terms of the original Markov chain express this in terms of the original Markov chain using that X r * = X t-r

58 Processes back in time Show how the transition probabilities p ji * relate to p ij : apply definition of conditional probability twice (Bayes): apply definition of conditional probability twice (Bayes): P(A B) = P(B A) P(A) / P(B)

59 Processes back in time Show how the transition probabilities p ji * relate to p ij : Hence:

60 Processes back in time Check that rows of the transition matrix P* sum to one, i.e.: p i1 * + p i2 * + + p js * = 1 Hereto: use the fact that

61 Processes back in time The two Markov chains defined by P and P* have the same stationary distribution. Indeed, as: we have:

62 Processes back in time Definition A Markov chain is called reversible if p ij * = p ij. In that case: p ij * = p ij = p ji φ j / φ i Or, φ i p ij = φ j p ji for all i and j. These are the so-called detailed balance equations. Theorem A Markov chain is reversible if and only if the detailed balance equations hold.

63 Processes back in time Example 1 The 1 st order Markov chain with transition matrix: is irreversible. Check that this (deterministic) Markov chain does not satisfy the detailed balance equations. Irreversibility can be seen from a sample of this chain: ABCABCABCABCABCABC In the reverse direction transitions from B to C do not occur!

64 Processes back in time Example 2 The 1 st order Markov chain with transition matrix: is irreversible. Again, a uniform stationary distribution: As P is not symmetric, the detailed balance equations are not satisfies: p ij /3 p ji / 3 for all i and j.

65 Processes back in time Example 2 (continued) The irreversibility of this chain implies: P(A B C A) B C A A A A = P(A B) P(B C) P(C A) B B B B = 08* * * 0.1 * 0.1 = P(A C) P(C B) P(B A) = P(A C B A). It matters how one walks from A to A. C C C C A A A A B B B B C C C C Or, it matters whether one walks forward or backward.

66 Processes back in time Kolmogorov condition for reversibility A stationary Markov chain is reversible if and only if any path from state E i to state E i has the same probability as the path in the opposite direction. Or, the Markov chain is reversible if and only if: for all i, i 1, i 2,, i k. 1 2 k Eg: E.g.: P(A B C A) = P(A C B A). A A A A A A A A B B B B B B B B C C C C C C C C

67 Processes back in time Interpretation For a reversible Markov chain it is not possible to determine the direction of the process from the observed state sequence alone. Molecular l phylogenetics aims to reconstruct t evolutionary relationships between present day species from their DNA sequences. Reversibility is then an essential assumption. Genes are transcribed in one direction only (from the 3 end to the 5 end). The promoter is only on the 3 end. This suggests irreversibility. e

68 Processes back in time E. Coli For a gene in the E. Coli genome, we estimate: Transition matrix [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] Stationary distribution [1] Then, the detailed balance equation do not hold, e.g.: π 1 p 12 π 2 p 21.

69 Processes back in time Note Within evolution theory the notion of irreversibility refers to the presumption that complex organisms once lost evolution will not appear in the same form. Indeed, the likelihood of reconstructing a particular phylogenic system is infinitesimal small.

70 Application: motifs

71 Application: motifs Study the sequence of the promotor region up-stream of a gene. gene DNA upstream promotor region This region contains binding sites for the transcription factors that regulate the transcription of the gene. gene DNA

72 Application: motifs The binding sites of a transciption factor (that may regulate multiple genes) share certain sequence patterns, motifs. Not all transcription factors and motifs are known. Hence, a high occurrence of a particular sequence pattern in the upstream regions of a gene may indicate that it has a regulatory function (e.g., binding site). Problem Determine the probability of observing m motifs in a background generated by a 1 st order stationary Markov chain.

73 Application: motifs An h-letter word W = w 1 w 2 w h is a map from {1,, h} to, where some non-empty set, called the alphabet. In the DNA example: and, e.g.:

74 Application: motifs A word W is p-periodic if The lag between two overlapping occurrences of the word. The set of all periods of W (less than h) is the period set, p ( ) p, denoted by. In other words, the set of integers 0 < p < h such that a new occurrence of W can start p letters after an occurrence of W.

75 Application: motifs If W 1 = CGATCGATCG, then For: CGATCGATC CGATCGATCGATC CGATCGATCGATCGATC If W 2 = CGAACTG, then

76 Application: motifs Let N(W) be the number of (overlapping) occurrences of an h-letter word W in a random sequence n on A. If Y t is the random variable defined by: then

77 Application: motifs Also, denote the number of occurences in W by: and If W = CATGA, then: n(a) = 2 and n(a ) = 1

78 Application: motifs Assume a stationary 1 st order Markov model for the random sequence of length n. The probability of an occurrence of W in the random sequence is given by:

79 Application: motifs In the 1 st order Markov model, the expected number of occurrences of W is approximated by: and its variance by:

80 Application: motifs To find words with exceptionally frequency in the DNA, the following (asymptotically) standard normal statistic is used: The p-value of word W is then given by:

81 Application: motifs Note Robin, Daudon (1999) provide exact probabilities of word occurences in random sequences. However, Robin, Schbath (2001) point out that calculation of the exact probabilities is computationally intensive. Hence, the use of an approximation here.

82 References & further reading

83 References and further reading Ewens, W.J, Grant, G (2006), Statistical Methods for Bioinformatics, Springer, New York. Reinert, G., Schbath, S., Waterman, M.S. (2000), Probabilistic and statistical properties of words: an overview, Journal of Computational Biology, 7, Robin, S., Daudin, J.-J. (1999), Exact distribution of word occurrences in a random sequence of letters, Journal of Applied Probability, 36, Robin, S., Schbath, S. (2001), Numerical comparison of several approximations of the word count distribution ib ti in random sequences, Journal of Computational Biology, 8(4), Schbath, S. (2000), An overview on the distribution of word counts in Markov chains, Journal of Computational. Biology, 7, Schbath, S., Robin, R. (2009), How can pattern statistics be useful for DNA motif discovery?. In Scan Statistics: ti ti Methods and Applications by Glaz, J. et al. (eds.).

84 This material is provided under the Creative Commons Attribution/Share-Alike/Non-Commercial License. See for details.

Stochastic processes and Markov chains (part II)

Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Amsterdam, The