Chapter 2 Class Notes Words and Probability

Size: px

Start display at page:

Download "Chapter 2 Class Notes Words and Probability"

Laurence Spencer
5 years ago
Views:

1 Chapter 2 Class Notes Words and Probability Medical/Genetics Illustration reference Bojesen et al (2003), Integrin 3 Leu33Pro Homozygosity and Risk of Cancer, J. NCI. Women only 2 x 2 table: Stratification (Type) Outcome Status With Cancer Without Cancer Total Non-carriers 501 (14.4%) Homozygotes 29 (21.5%) Total The variables here are X = stratification (either non-carrier or homozygote group) Y = cancer status at end of study. Both variables here are nominal (and therefore qualitative). 1 P a g e

2 For a review of basic probability (STAT-335), see Chap3 notes at Given a single-strand sequence such as (another is on p.37): ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCC CCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGC CTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGG AAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCC CTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAG TTTAATTACAGACCTGAA Summarize/analyze using statistics: is this a coding region? Does this segment differ from other regions? What is its function? Words of length: 1 are 1-tuples (e.g., nucleotides or purines/pyrimidines) 2 are 2-tuples (e.g., di-nucleotides) 3 are 3-tuples or triples (e.g., codons) etc. DNA sequences from different sources/regions of a genome may be distinguished from each other by their k-tuple content Base Composition (k=1): Consider for a moment the duplex DNA and residues A, C, G and T. Notice that fr(a) = fr(t), fr(c) = fr(g); since fr(a) + fr(c) + fr(g) + fr(t) = 1, it follows that fr(a+t) = 1- fr(c+g). So, you only need to know fr(c+g). These are given for different organisms on p.39: these percentages range from 31.6% for Mycoplasma genitalium to 66.4% for Pseudomonas aeruginosa PAO1 (bacteria). Asymmetries can be detected especially on the leading strand using the 2 P a g e

3 2.3. Introduction to Probability (review) A discrete random variable (RV) X takes on values with respective probabilities and such that. Associating the probabilities with the realized values of X either in a table, formula or graph is the probability mass function of X. An example could be for nucleotides with ; we can then derive a second RV which is only counts A s: it is 1 if the nucleotide is A and 0 otherwise; this is called a Bernoulli RV. Now, consider a series of Bernoulli RVs,,, corresponding to positions 1, 2 n; again, each of these is 1 if the respective nucleotide is A and 0 otherwise. For each of these, ; also, ( ). Then is a RV that counts the number of A s in the sequence of length n. Next, we ll make a big assumption (for the moment): that what occurs at one position is independent of what occurs elsewhere (see STAT-335 notes and p ). In the case of independence, probabilities of intersections (AND) amount to simply multiplying the unconditional probabilities Expected Values and Variances: the expected value (EV) of RV X is just its mean:. Thus, for the above. Also, since, (dropping the subscript for convenience). Next, variance is defined as [ ] ; the short-cut 3 P a g e

4 formula is then. So, by the short-cut formula. A final snippet: since we re assuming the s are independent and since the variance of a sum of independent RVs is the sum of the individual variances, (again dropping the subscript) The Binomial Distribution: by now, you have noticed that is a Binomial RV with parameters and. Hence, 2.4. Simulating from Probability Distributions: to understand the behavior (e.g. distribution) of a RV such as, simulation can be very helpful. To do so, we d generate a large number (m) of :. We could then calculate the sample mean ( ), sample variance ( ( ) ), and histogram to get an idea of,, and the distribution of. To do this, pseudo-random number generators are used; these generators simulate uniform RVs on the interval, and then convert. To illustrate, it s easy to simulate from distribution (a) for A versus rest, and (b) for all four nucleotides using Minitab and R. In R, we use the sample command: pi<-c(0.25,0.75) x<-c(1,0) seq1<-sample(x,10000,replace=t,pi) seq1[1:30] 4 P a g e

5 For the nucleotide problem: pi<-c(0.25,0.25,0.25,0.25) x<-c(1,2,3,4) seq<-sample(x,10000,replace=t,pi) seq[1:30] hist(seq) To simulate from the Binomial ( ) distribution we use the rbinom command: x<-rbinom(10000,1000,0.25) x[1:10] mean(x) sd(x)^2 hist(x,xlab="number of successes",main="binomial Simulation") Now, suppose in a sequence of length, we observe A s, and we want to know the associated p-value in a test where the alternative hypothesis is ; then we wish to find the probability of at least 280 successes in this binomial experiment. Ways to find this are (a) direct calculation from equation (2.20) on p.48 ( ), (b) approximation using the CLT ( ( ) ), (c) using Minitab or calculator, or (d) simulation (see p.63 exercise 4) Biological Words (k=2): let denote the nucleotide at position and so is the dinucleotide starting at the same position (reading of course in the to direction); can take on values A,C,G,T, and can equal AA,AC,AG,AT,CA,CC, 5 P a g e

6 CG,CT,GA,GC,GG,GT,TA,TC,TG,TT. Dinucleotides are important in part because physical parameters associated with them can describe the trajectory of the DNA helix through space (DNA bending), which may have effects on gene expression. Next, with and, we can test for independence, for all pairs, by using need to calculate with. First, we { Then, the relevant test statistic is, so for example, if then we conclude that the model does not fit for this dinucleotide pair at the 5% level of significance. In practice, frequencies from the data fr(a), fr(c), fr(g), fr(t) are used to estimate the probabilities. Empirical results are given in Table 2.2 on p.50 for E. coli and Mycoplasma genitalium; clearly, the first 1000 bp for E. coli deviate from the iid model. 6 P a g e

7 2.6. Introduction to Markov Chains: As seen by Table 2.2, the iid model may not be rich enough, and a richer one may be needed one approach is to use a Markov chain. If in a nucleotide at position we observe a C, then the nucleotide at the next position may depend upon this information, and we are led to conditional probabilities; often we find that in this case the probability that the next position is G is less than its being an A, C, or T Conditional Probability (review): in the introductory notes, we saw that the probability of events A and B (intersection) occurring is and the probability of events A or B (union) occurring is. Then, the conditional probability of event A given that B has occurred is: Similarly,. An immediate consequence of this is the multiplication rule:. Bayes Theorem follows directly: Also, let for a partition of (so that all are disjoint for and exhaustive ), then, so (*) can be rewritten The Markov Property: the sequence { } is called a first-order Markov chain if the probability of finding a particular character at position given the preceding characters at positions is identical to the probability 7 P a g e

8 of observing the character at position given the character state of the immediate preceding position,. Formally, Markov chains of order 2 correspond to the case where the conditional probability of the present position depends on the previous 2 positions, and so on. We also consider only Markov chains that are homogeneous: the above property doesn t depend upon (position along the nucleotide or segment). Next, let be the one-step transition probabilities for ; for the nucleotide problem, {A,C,G, T}. For the nucleotide problem, these probabilities are given in the one-step transition matrix: [ ] Each row sums to one, so ; this follows since after each position there must be one character from next in the sequence, and can be written in the nucleotide case as [ ] Next, we have to indicate how the Markov chain begins; for this, we let the initial probability distribution have elements 8 P a g e

9 Also, at time, we have ; in matrix terms:. In the nucleotide case, the right hand side is [ ] [ ] The result is also a matrix (vector), and the first element is On p.53 (text), the authors show that the transition matrix from time to time is given by, and so on, so that means that. A stationary distribution for the chain is when, so in matrix form: A Markov Chain Simulation: following is the dinucleotide frequencies of M. genitalium: A C G T A C G T [ ] Summing across the rows, we obtain empirical estimates, and then dividing through (e.g. ), we get 9 P a g e

10 A C G T A C G T [ ] We can also assume that the initial distribution is (as above) Now we can simulate (see Computational Example 2.3 on p.55): markov1 <- function(x,pi,p,n) { mg <- rep(0,n) mg[1] <- sample(x,1,replace=true,pi) for(k in 1:(n-1)) { mg[k+1]<-sample(x,1,replace=t,p[mg[k],]) } return(mg) } x<-c(1:4) pi<-c(0.342,0.158,0.158,0.342) P<-matrix(scan(),ncol=4,nrow=4,byrow=T) tmp<-markov1(x,pi,p,50000) length(tmp[tmp[]==1])/length(tmp) [1] length(tmp[tmp[]==2])/length(tmp) [1] P a g e

11 length(tmp[tmp[]==3])/length(tmp) [1] length(tmp[tmp[]==4])/length(tmp) [1] Also, we used the above to simulation the dinucleotide distribution (not shown); since it matches the original, we conclude as in the text (p.57) that the Markov model provides a good probabilistic description of the data for M. genitalium DNA Biological Words with k=3 (Codons): We ll discuss two measures here relating codons to amino acids with an eye to distinguishing coding from non-coding genetic regions. First, we ll compare expressed codon frequencies to those which would be expressed under independence. To illustrate using E. coli, from Table 2.1 (p.39),, so under independence, and. These are the only two codons coding for amino acid Phe (phenylaline), so under independence we expect the proportion to be roughly. Similar results are given in the Predicted column in Table 2.3, and these results are compared with the empirical results for two gene classes (I and II). Ménigue et al then noted when there were big differences and investigated and found that Class II genes were largely those such as ribosomal proteins or translation factors genes expressed at high levels whereas Class I genes were mostly those that are expressed at moderate levels. 11 P a g e

12 Another measure is the (codon adaptation index): [ ] It is the geometric mean of the ratios of the probabilities for the codons actually used to the probability of the codons most frequently used in highly expressed genes. It is illustrated on p.59 again using E. coli. In E. coli, a sample of 500 protein-coding genes displayed CAI values in the range from 0.2 to Further, there is a correlation between the CAI and mrna levels. In other words, the CAI for a gene sequence in genomic DNA provides a first approximation of its expression level: if the CAI is relatively large, then we would predict that the expression level is also large. 12 P a g e

13 13 P a g e

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline