Week 2 Statistics for bioinformatics and escience

Similar documents
A Probability Primer. A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes.

Week 3 Statistics for bioinformatics and escience

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Simulations. . p.1/25

Ch. 8 Math Preliminaries for Lossy Coding. 8.4 Info Theory Revisited

Notes on the second moment method, Erdős multiplication tables

(NRH: Sections 2.6, 2.7, 2.11, 2.12 (at this point in the course the sections will be difficult to follow))

There are two basic kinds of random variables continuous and discrete.

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Review of Probability. CS1538: Introduction to Simulations

Practice Questions for Final

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Lecture 11: Continuous-valued signals and differential entropy

Review: mostly probability and some statistics

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

STAT 801: Mathematical Statistics. Distribution Theory

An Introduction to Parameter Estimation

Introduction to Linear Regression

Continuous distributions

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Stephen Scott.

Random Variables and Their Distributions

Appendix: Synthetic Division

Multivariate distributions

Gamma and Normal Distribuions

Exponential, Gamma and Normal Distribuions

Discrete Random Variables

Writing proofs for MATH 51H Section 2: Set theory, proofs of existential statements, proofs of uniqueness statements, proof by cases

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

p. 6-1 Continuous Random Variables p. 6-2

1.2. Functions and Their Properties. Copyright 2011 Pearson, Inc.

LANGEBIO - BIOSTATISTICS

1 Review of Probability and Distributions

Algorithmische Bioinformatik WS 11/12:, by R. Krause/ K. Reinert, 14. November 2011, 12: Motif finding

1 Probability theory. 2 Random variables and probability theory.

Probability and Estimation. Alan Moses

Review of Probability Theory

MAS1302 Computational Probability and Statistics

Probability and Distributions

Exponents. Reteach. Write each expression in exponential form (0.4)

Bivariate distributions

This does not cover everything on the final. Look at the posted practice problems for other topics.

CSCE 471/871 Lecture 3: Markov Chains and

4 Expectation & the Lebesgue Theorems

Notes on Mathematics Groups

Hints/Solutions for Homework 3

Math Review Sheet, Fall 2008

STAT 231 Homework 5 Solutions

Discrete Probability Refresher

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Probability theory. References:

Statistical Inference, Populations and Samples

ENGG2430A-Homework 2

Introduction to Bayesian Statistics

Introduction to Real Analysis Alternative Chapter 1

SDS 321: Introduction to Probability and Statistics

Counting principles, including permutations and combinations.

Multiple Integrals and Probability Notes for Math 2605

Slides 8: Statistical Models in Simulation

Single Maths B: Introduction to Probability

STAT Chapter 5 Continuous Distributions

MS&E 226: Small Data

Package clonotyper. October 11, 2018

Week 12-13: Discrete Probability

LIST OF FORMULAS FOR STK1100 AND STK1110

Multivariate Distributions

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Problem 1. Problem 2. Problem 3. Problem 4

Formulas for probability theory and linear models SF2941

Introduction to probability and statistics

Hidden Markov Models

Arkansas Tech University MATH 3513: Applied Statistics I Dr. Marcel B. Finan

Average and Instantaneous Velocity. p(a) p(b) Average Velocity on a < t < b =, where p(t) is the position a b

Correlation. January 11, 2018

BIOINFORMATICS TRIAL EXAMINATION MASTERS KT-OR

{ p if x = 1 1 p if x = 0

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Review of Probabilities and Basic Statistics

Chapter 5. Chapter 5 sections

Some Concepts of Probability (Review) Volker Tresp Summer 2018

STAT 450: Statistical Theory. Distribution Theory. Reading in Casella and Berger: Ch 2 Sec 1, Ch 4 Sec 1, Ch 4 Sec 6.

1 Probability and Random Variables

Intro to Probability Instructor: Alexandre Bouchard

Expected Values, Exponential and Gamma Distributions

STT 441 Final Exam Fall 2013

Math Bootcamp 2012 Miscellaneous

Data Analysis and Monte Carlo Methods

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

19 : Slice Sampling and HMC

Probability review. September 11, Stoch. Systems Analysis Introduction 1

Hidden Markov Models. Terminology, Representation and Basic Problems

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Concentration Inequalities

1 Variance of a Random Variable

0 otherwise. Page 100 Exercise 9: Suppose that a random variable X has a discrete distribution with the following p.m.f.: { c. 2 x. 0 otherwise.

Spring Nikos Apostolakis

New test - November 03, 2015 [79 marks]

Notes 9 : Infinitely divisible and stable laws

The Binomial distribution. Probability theory 2. Example. The Binomial distribution

EXAMPLES OF PROOFS BY INDUCTION

Transcription:

Week 2 Statistics for bioinformatics and escience Line Skotte 20. november 2008 2.5.1-5) Revisited. When solving these exercises, some of you tried to capture a whole open reading frame by pattern matching with a regular expression. I was very impressed by your solutions. But when using repetitions, *, in regular expressions, you must remember that the pattern matching often is greedy per default. That means that a command like gregexpr("atg.{3}*t(aa GA AG)",tmp) ends with the last stop codon in reading frame with the matched start codon, not with the first stop codon in frame after the start codon. To avoid this, you have to make the regular expression non-greedy. One way to do this is to use gregexpr("(atg)(.{3})*?(t(aa GA AG))",tmp, perl=true) instead. The *? means that the match should use as few repetitions of.{3} as possible. There might be more elegant ways of achieving the same. Notice that now it is the length of the match we are interested in, therefore we must access the attribute of the object that this function returns. The following function generate a random sequence, finds the open reading frames and simply returns the length of the first open reading frame. orffun<-function(){ tmp=paste(sample(c("a","c","g","t"),1000,replace=t),sep="",collapse="") orf<-gregexpr("(atg)(.{3})*?(t(aa GA AG))",tmp, perl=true) return((attr(orf[[1]],"match.length")[1]/3)-2) } Then the following commands orfs <- replicate(1000,orffun()) barplot(table(orfs), xlab="length", ylab="frequency") give the plot below. 1

Frequency 0 10 20 30 40 50 0 4 8 13 19 25 31 37 43 49 55 61 67 73 82 107 Length 2.6.1) Consider for β > 0 the function F (x) = 1 exp( x β ), x 0. To show that this is a distribution function, we must according to Theorem 2.6.3 (p. 30) show that properties (i), (ii) and (iii) of Theorem 2.6.2 (p. 29) is satisfied. (i) F is increasing: Let x 1 x 2, then x β 1 xβ 2 and thus xβ 1 xβ 2. This implies that exp( x β 1 ) exp( xβ 2 ). Therefore (ii) It is understood that F (x 1 ) = 1 exp( x β 1 ) 1 exp( xβ 2 ) = F (x 2). F (x) = (1 exp( x β ))1 [0, ) (x). Therefore it is obvious that F (x) 0 when x. Furthermore when x we have that x β, which gives exp( x β ) 0 for x. Thus it is also obvious that F (x) = 1 exp( x β ) 1 for x. (iii) Finally to show that F is right continuous at any x R, note that the function 1 exp( x β ) is continuous (since it is a combination of continuous functions). That gives us that F is continuous in all x R\{0}, specially right continuous. 2

2.6.4) For λ > 0, let For x = 0, we have that F (0) = 0 and that lim ε 0,ε>0 F (ε) = 0. Thus for all x R we have that f λ (x) = ( lim F (x + ε) = F (x). ε 0,ε>0 1 1 + x2 2λ ) λ+ 1, x R. 2 ( Notice that for any x R, we have that x2 2λ 0, thus 1 + x2 2λ that f λ (x) > 0. Now define the normalization constant c(λ) = f λ (x)dx. ) λ+ 1 2 > 0 and it follows The integrate function in R, demands that the function it integrates, must be an R function taking a numeric first argument and returning a numeric vector of the same length. So we can define f λ in R in the following way: flambda <- function(x,lambda){ 1/((1+(x^2)/(2*lambda))^(lambda+0.5)) } Numerical integration with λ = 1 2 is then carried out by writing integrate(flambda, -Inf, Inf, lambda=0.5). The integral can be calculated by vlambda <- c(0.25,0.5, 1, 2, 5, 10, 20, 50, 100) numint <- sapply(vlambda, function(parm){ integrate(flambda, -Inf, Inf, lambda=parm)$value }) for several different values of λ at the same time! Plotting by plot(vlambda, numint, ylim=c(2,4), ylab= c(lambda), xlab= lambda ) abline(h=pi, col= red, lty=2) abline(h=sqrt(2*pi), col= red, lty=2) makes the comparison with π and 2π easy. We notice that c(0.5) = π and that c(λ) 2π when λ. 3

c(lambda) 2.0 2.5 3.0 3.5 4.0 0 20 40 60 80 100 lambda Since c(λ) > 0, we have that c(λ) 1 f λ (x) > 0 for all x R and since c(λ) 1 f λ (x)dx = c(λ) 1 f λ (x)dx = 1 the function c(λ) 1 f λ (x) is a density (according to page 32). To compare this t-distribution density with the density for the normal distribution plot(-50:50/10, dnorm(-50:50/10), xlab= x, ylab= f(x) ) points(-50:50/10, 1/numint[2]*flambda(-50:50/10, 0.5), col= red ) The red curve is the t-distribution. 4

f(x) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x 2.6.6) According to Example 2.6.15, the density for the Gumbel distribution is f(x) = exp( x) exp( exp( x)) = exp( x exp( x)). The mean is defined if x f(x)dx < as µ = of the mean in R, can be done by xf(x)dx. Numerical computation fgumbel <- function(x){exp(-x-exp(-x))} mugumbel <- integrate(function(x){x*fgumbel(x)}, -Inf, Inf) Actually, the mean equals the Euler-Mascheroni constant, which is related to the Γ- function. Now the variance σ 2 = (x µ)2 f(x)dx can be calculated numerically by vargumbel <- integrate(function(x){(x-mugumbel$value)^2*fgumbel(x)}, -Inf, Inf) The variance in the gumbel distribution equals π 2 /6. 2.8.5) We consider the probabilistic model of the pair of letters from Example 2.8.11 (p. 55). The sample space is E = {A,C,G,T} {A,C,G,T}. Let X and Y denote the random variables representing the two aligned nucleic acids and let their joint distribution be as given in the exercise. 5

By Definition 2.8.10 (p. 54), the point probabilities of the marginal distribution P 1 of X is given by p 1 (A) = P 1 ({A}) = P(X {A}) = P((X, Y ) {A} {A, C, G, T }) = P ({A} {A,C,G,T}) = y {A,C,G,T} = 0.12 + 0.03 + 0.04 + 0.01 = 0.20. p(a, y) All the other point probabilities is found in the same way. Thus we get that the marginal distribution P 1 of X is given by the point probabilities p 1 (A) = 0.20, p 1 (C) = 0.37, p 1 (G) = 0.22 and p 1 (T) = 0.21. Again by definition 2.8.10, the point probabilities of the marginal distribution P 2 of Y is given by p 2 (A) = P 2 ({A}) = P ({A,C,G,T} {A}) = p(x, A) = 0.12 + 0.02 + 0.02 + 0.05 = 0.21. x {A,C,G,T} The other point probabilities is found in the same way. The marginal distribution P 2 of Y is given by the point probabilities p 1 (A) = 0.21, p 1 (C) = 0.34, p 1 (G) = 0.24 and p 1 (T) = 0.21. Now assume that X and Y has the same marginal distributions, P 1 and P 2 as above, but that X and Y are independent. Let P denote the joint distribution, that makes X and Y independent. By Definition 2.9.1 P must for all events M 1 {A,C,G,T} and M 2 {A,C,G,T} satiesfy that P (M 1 M 2 ) = P 1 (M 1 )P 2 (M 2 ). Since the joint sample space E = {A,C,G,T} {A,C,G,T} is discrete, P is given by its point probabilities. According to the above, we have that p (A, A) = P ({A} {A}) = P 1 ({A})P 2 ({A}) = p 1 (A)p 2 (A) and similar for all other pairs of nucleotides. Thus we can calculate the point probabilities of the joint distribution P simply by multiplying the appropriate point probabilities of the marginal distributions P 1 and P 2. (This is also stated in Theorem 2.9.3!) The point probabilities, p of the distribution P that make X and Y independent with marginal distributions P 1 and P 2 is then found to be A C G T A 0.0420 0.0680 0.0480 0.0420 C 0.0777 0.1258 0.0888 0.0777 G 0.0462 0.0748 0.0528 0.0462 T 0.0441 0.0714 0.0504 0.0441 6

The probability under P of the event X = Y is found by P(X = Y ) = P ({(x, y) E x = y}) = {(x,y) E x=y} = 0.0420 + 0.1258 + 0.0528 + 0.0441 = 0.2647 p (x, y) The probability that two aminoacids are equal is much smaller then in the example. When X and Y are independent, the probability of obtaining a pair of not equal nucleotides is higher. 2.9.1) We think of the data as representing the independent outcomes of the random vector (X, Y ), note that X and Y are dependent. The sample space E = E a E a, where E a denotes the amino acid alphabet. The data is loaded into R with aadata <- read.table("http://www.math.ku.dk/~richard/download/courses/binf_2007/aa.txt") We cross tabulate the data with aafreq <- table(aadata). The matrix of relative frequencies is then obtained by division with the total number of observations: N <- dim(aadata)[1] relfreq <- aafreq/n 2.9.2) Assume that the joint distribution, P of X and Y are given by the point probabilities that is the relative frequencies from above. The point probabilities p 1 and p 2 of the marginal distributions P 1 and P 2 of X and Y are by Definition 2.8.10 calculated by prob_1<-apply(relfreq,1,sum) prob_2<-apply(relfreq,2,sum) It follows that X and Y are not independent, since The score matrix is calculated by p(a, A) = 0.0553 0.0064 = p 1 (A)p 2 (A). score <- log(relfreq/outer(prob_1,prob_2)) Since S x,y = log(p(x, y)) log(p 1 (x)p 2 (x)) in can be thought of as a measure of how different the joint distribution is from the distribution making X and Y independent with marginal distributions P 1 and P 2. Or it can be thought of as a way to compare 7

how probable it is to observe (x, y) under the joint distribution compared with under the independence-distribution. S X,Y is simply a transformation of the random vector (X, Y ), and as such it is itself a random variable. The sample space of S X,Y is finite, since only a finite number of values is possible. When (X, Y ) has distribution P, the log is always defined, since it is only with probability zero that we get a pair (x, y) for wich p(x, y) = 0 (the problem was that log(0) is undefined). Example 2.4.8 tells us exactly how to calculate the mean under the different distributions µ = x E h(x)p(x). This is done in R by score[score==-inf]<-0 sum(score*relfreq) (When the joint distribution is 0, then the score funtion equals in R, but since this occurs with probabitity zero, we can change the values of the score function for these pairs of letters without changing the distribution). 8