Week 2 Statistics for bioinformatics and escience

Week 2 Statistics for bioinformatics and escience Line Skotte 20. november 2008 2.5.1-5) Revisited. When solving these exercises, some of you tried to capture a whole open reading frame by pattern matching with a regular expression. I was very impressed by your solutions. But when using repetitions, *, in regular expressions, you must remember that the pattern matching often is greedy per default. That means that a command like gregexpr("atg.{3}*t(aa GA AG)",tmp) ends with the last stop codon in reading frame with the matched start codon, not with the first stop codon in frame after the start codon. To avoid this, you have to make the regular expression non-greedy. One way to do this is to use gregexpr("(atg)(.{3})*?(t(aa GA AG))",tmp, perl=true) instead. The *? means that the match should use as few repetitions of.{3} as possible. There might be more elegant ways of achieving the same. Notice that now it is the length of the match we are interested in, therefore we must access the attribute of the object that this function returns. The following function generate a random sequence, finds the open reading frames and simply returns the length of the first open reading frame. orffun<-function(){ tmp=paste(sample(c("a","c","g","t"),1000,replace=t),sep="",collapse="") orf<-gregexpr("(atg)(.{3})*?(t(aa GA AG))",tmp, perl=true) return((attr(orf[[1]],"match.length")[1]/3)-2) } Then the following commands orfs <- replicate(1000,orffun()) barplot(table(orfs), xlab="length", ylab="frequency") give the plot below. 1

Frequency 0 10 20 30 40 50 0 4 8 13 19 25 31 37 43 49 55 61 67 73 82 107 Length 2.6.1) Consider for β > 0 the function F (x) = 1 exp( x β ), x 0. To show that this is a distribution function, we must according to Theorem 2.6.3 (p. 30) show that properties (i), (ii) and (iii) of Theorem 2.6.2 (p. 29) is satisfied. (i) F is increasing: Let x 1 x 2, then x β 1 xβ 2 and thus xβ 1 xβ 2. This implies that exp( x β 1 ) exp( xβ 2 ). Therefore (ii) It is understood that F (x 1 ) = 1 exp( x β 1 ) 1 exp( xβ 2 ) = F (x 2). F (x) = (1 exp( x β ))1 [0, ) (x). Therefore it is obvious that F (x) 0 when x. Furthermore when x we have that x β, which gives exp( x β ) 0 for x. Thus it is also obvious that F (x) = 1 exp( x β ) 1 for x. (iii) Finally to show that F is right continuous at any x R, note that the function 1 exp( x β ) is continuous (since it is a combination of continuous functions). That gives us that F is continuous in all x R\{0}, specially right continuous. 2

2.6.4) For λ > 0, let For x = 0, we have that F (0) = 0 and that lim ε 0,ε>0 F (ε) = 0. Thus for all x R we have that f λ (x) = ( lim F (x + ε) = F (x). ε 0,ε>0 1 1 + x2 2λ ) λ+ 1, x R. 2 ( Notice that for any x R, we have that x2 2λ 0, thus 1 + x2 2λ that f λ (x) > 0. Now define the normalization constant c(λ) = f λ (x)dx. ) λ+ 1 2 > 0 and it follows The integrate function in R, demands that the function it integrates, must be an R function taking a numeric first argument and returning a numeric vector of the same length. So we can define f λ in R in the following way: flambda <- function(x,lambda){ 1/((1+(x^2)/(2*lambda))^(lambda+0.5)) } Numerical integration with λ = 1 2 is then carried out by writing integrate(flambda, -Inf, Inf, lambda=0.5). The integral can be calculated by vlambda <- c(0.25,0.5, 1, 2, 5, 10, 20, 50, 100) numint <- sapply(vlambda, function(parm){ integrate(flambda, -Inf, Inf, lambda=parm)$value }) for several different values of λ at the same time! Plotting by plot(vlambda, numint, ylim=c(2,4), ylab= c(lambda), xlab= lambda ) abline(h=pi, col= red, lty=2) abline(h=sqrt(2*pi), col= red, lty=2) makes the comparison with π and 2π easy. We notice that c(0.5) = π and that c(λ) 2π when λ. 3

c(lambda) 2.0 2.5 3.0 3.5 4.0 0 20 40 60 80 100 lambda Since c(λ) > 0, we have that c(λ) 1 f λ (x) > 0 for all x R and since c(λ) 1 f λ (x)dx = c(λ) 1 f λ (x)dx = 1 the function c(λ) 1 f λ (x) is a density (according to page 32). To compare this t-distribution density with the density for the normal distribution plot(-50:50/10, dnorm(-50:50/10), xlab= x, ylab= f(x) ) points(-50:50/10, 1/numint[2]*flambda(-50:50/10, 0.5), col= red ) The red curve is the t-distribution. 4

f(x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x 2.6.6) According to Example 2.6.15, the density for the Gumbel distribution is f(x) = exp( x) exp( exp( x)) = exp( x exp( x)). The mean is defined if x f(x)dx < as µ = of the mean in R, can be done by xf(x)dx. Numerical computation fgumbel <- function(x){exp(-x-exp(-x))} mugumbel <- integrate(function(x){x*fgumbel(x)}, -Inf, Inf) Actually, the mean equals the Euler-Mascheroni constant, which is related to the Γ- function. Now the variance σ 2 = (x µ)2 f(x)dx can be calculated numerically by vargumbel <- integrate(function(x){(x-mugumbel$value)^2*fgumbel(x)}, -Inf, Inf) The variance in the gumbel distribution equals π 2 /6. 2.8.5) We consider the probabilistic model of the pair of letters from Example 2.8.11 (p. 55). The sample space is E = {A,C,G,T} {A,C,G,T}. Let X and Y denote the random variables representing the two aligned nucleic acids and let their joint distribution be as given in the exercise. 5

By Definition 2.8.10 (p. 54), the point probabilities of the marginal distribution P 1 of X is given by p 1 (A) = P 1 ({A}) = P(X {A}) = P((X, Y ) {A} {A, C, G, T }) = P ({A} {A,C,G,T}) = y {A,C,G,T} = 0.12 + 0.03 + 0.04 + 0.01 = 0.20. p(a, y) All the other point probabilities is found in the same way. Thus we get that the marginal distribution P 1 of X is given by the point probabilities p 1 (A) = 0.20, p 1 (C) = 0.37, p 1 (G) = 0.22 and p 1 (T) = 0.21. Again by definition 2.8.10, the point probabilities of the marginal distribution P 2 of Y is given by p 2 (A) = P 2 ({A}) = P ({A,C,G,T} {A}) = p(x, A) = 0.12 + 0.02 + 0.02 + 0.05 = 0.21. x {A,C,G,T} The other point probabilities is found in the same way. The marginal distribution P 2 of Y is given by the point probabilities p 1 (A) = 0.21, p 1 (C) = 0.34, p 1 (G) = 0.24 and p 1 (T) = 0.21. Now assume that X and Y has the same marginal distributions, P 1 and P 2 as above, but that X and Y are independent. Let P denote the joint distribution, that makes X and Y independent. By Definition 2.9.1 P must for all events M 1 {A,C,G,T} and M 2 {A,C,G,T} satiesfy that P (M 1 M 2 ) = P 1 (M 1 )P 2 (M 2 ). Since the joint sample space E = {A,C,G,T} {A,C,G,T} is discrete, P is given by its point probabilities. According to the above, we have that p (A, A) = P ({A} {A}) = P 1 ({A})P 2 ({A}) = p 1 (A)p 2 (A) and similar for all other pairs of nucleotides. Thus we can calculate the point probabilities of the joint distribution P simply by multiplying the appropriate point probabilities of the marginal distributions P 1 and P 2. (This is also stated in Theorem 2.9.3!) The point probabilities, p of the distribution P that make X and Y independent with marginal distributions P 1 and P 2 is then found to be A C G T A 0.0420 0.0680 0.0480 0.0420 C 0.0777 0.1258 0.0888 0.0777 G 0.0462 0.0748 0.0528 0.0462 T 0.0441 0.0714 0.0504 0.0441 6

The probability under P of the event X = Y is found by P(X = Y ) = P ({(x, y) E x = y}) = {(x,y) E x=y} = 0.0420 + 0.1258 + 0.0528 + 0.0441 = 0.2647 p (x, y) The probability that two aminoacids are equal is much smaller then in the example. When X and Y are independent, the probability of obtaining a pair of not equal nucleotides is higher. 2.9.1) We think of the data as representing the independent outcomes of the random vector (X, Y ), note that X and Y are dependent. The sample space E = E a E a, where E a denotes the amino acid alphabet. The data is loaded into R with aadata <- read.table("http://www.math.ku.dk/~richard/download/courses/binf_2007/aa.txt") We cross tabulate the data with aafreq <- table(aadata). The matrix of relative frequencies is then obtained by division with the total number of observations: N <- dim(aadata)[1] relfreq <- aafreq/n 2.9.2) Assume that the joint distribution, P of X and Y are given by the point probabilities that is the relative frequencies from above. The point probabilities p 1 and p 2 of the marginal distributions P 1 and P 2 of X and Y are by Definition 2.8.10 calculated by prob_1<-apply(relfreq,1,sum) prob_2<-apply(relfreq,2,sum) It follows that X and Y are not independent, since The score matrix is calculated by p(a, A) = 0.0553 0.0064 = p 1 (A)p 2 (A). score <- log(relfreq/outer(prob_1,prob_2)) Since S x,y = log(p(x, y)) log(p 1 (x)p 2 (x)) in can be thought of as a measure of how different the joint distribution is from the distribution making X and Y independent with marginal distributions P 1 and P 2. Or it can be thought of as a way to compare 7

how probable it is to observe (x, y) under the joint distribution compared with under the independence-distribution. S X,Y is simply a transformation of the random vector (X, Y ), and as such it is itself a random variable. The sample space of S X,Y is finite, since only a finite number of values is possible. When (X, Y ) has distribution P, the log is always defined, since it is only with probability zero that we get a pair (x, y) for wich p(x, y) = 0 (the problem was that log(0) is undefined). Example 2.4.8 tells us exactly how to calculate the mean under the different distributions µ = x E h(x)p(x). This is done in R by score[score==-inf]<-0 sum(score*relfreq) (When the joint distribution is 0, then the score funtion equals in R, but since this occurs with probabitity zero, we can change the values of the score function for these pairs of letters without changing the distribution). 8