Simulations. . p.1/25 - PDF Free Download

Simulations Computer simulations of realizations of random variables has become indispensable as supplement to theoretical investigations and practical applications.. p.1/25

Simulations Computer simulations of realizations of random variables has become indispensable as supplement to theoretical investigations and practical applications. We can easily investigate a large number of scenarios (different P s) and a large number of replications. We can compute transformed distributions numerically. We can compute (complicated) mean values this is known as Monte Carlo simulation.. p.1/25

Generic simulation Two step procedure behind simulation of random variables The computer emulates the generation of independent, identically distributed random variables with the uniform distribution on the unit interval [0, 1]. The emulated uniformly distributed random variables are by transformation turned into variables with the desired distribution.. p.2/25

Theorem behind The following result is behind the generic simulation procedure for simulation from any P on E: Theorem: Let P 0 denote the uniform distribution on [0, 1] and h : [0, 1] E a map with the transformed probability measure on E being P = h(p 0 ). Then X 1, X 2,...,X n defined by X i = h(u i ) are n iid random variables each with distribution P.. p.3/25

The real problem What we need in practice is thus the construction of a transformation that can transform the uniform distribution on [0, 1] to the desired probability distribution.. p.4/25

The real problem What we need in practice is thus the construction of a transformation that can transform the uniform distribution on [0, 1] to the desired probability distribution. We focus here on two cases A general method for discrete distributions. A general method for probability measures on R given in terms of the distribution function.. p.4/25

But what about...... the simulation of the independent, uniformly distributed random variables?. p.5/25

But what about...... the simulation of the independent, uniformly distributed random variables? Thats a completely different story. Read D. E. Knuth, ACP, Chapter 3 or trust that R behaves well and that runif works correctly. We rely on a sufficiently good pseudo random number generator with the property that as long as we can not statistically detect differences from what the generator produces and true iid [0, 1]-uniformly distributed random variables, then we live happily in ignorance.. p.5/25

Discrete random variables If P is a probability measure on a discrete sample space E given by point probabilities p(x), x E choose for each x E an interval I(x) = (a(x), b(x)] [0, 1]. p.6/25

Discrete random variables If P is a probability measure on a discrete sample space E given by point probabilities p(x), x E choose for each x E an interval such that I(x) = (a(x), b(x)] [0, 1] the length, b(x) a(x), of I(x) equals p(x), and the intervals I(x) are mutually disjoint: I(x) I(y) = for x y. Letting u 1,...,u n be generated by a pseudo random number generator we define x i = x if u i I(x) for i = 1,...,n. Then x 1,...,x n is a realization of n iid random variables with distribution having point probabilities p(x), x E.. p.6/25

Generalized inverse Definition: Let F : R [0, 1] be a distribution function. A function F : (0, 1) R that satisfies F(x) y x F (y) (0) for all x R and y (0, 1) is called a generalized inverse of F.. p.7/25

Generalized inverse If F has a true inverse (F is strictly increasing and continuous) then F equals the inverse, F 1, of F.. p.8/25

Generalized inverse If F has a true inverse (F is strictly increasing and continuous) then F equals the inverse, F 1, of F. All distribution functions has a generalized inverse we find it by solving the inequality F(x) y.. p.8/25

Continuous sample space We will simulate from P on R having distribution function F. First find the generalized inverse, F : (0, 1) R, of F.. p.9/25

Continuous sample space We will simulate from P on R having distribution function F. First find the generalized inverse, F : (0, 1) R, of F. Then we let u 1,...,u n be generated by a pseudo random number generator and we define x i = F (u i ) for i = 1,...,n. Then x 1,...,x n is a realization of n iid random variables with distribution having distribution function F.. p.9/25

Local alignments Assume that X 1,...,X n and Y 1,...,Y m are in total n + m iid random variables with values in the 20 letter amino acid alphabet E = { A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y,V }. We want to find optimal local alignment and in particular we are interested in the score for optimal local alignment. This is a function h : E n+m R. An X- and a Y -subsequence that are matched letter by letter, matched letters are given a score, positive or negative, and gaps in the subsequences are given a penalty.. p.10/25

Local alignment scores Denote by S n,m = h(x 1,...,X n, Y 1,...,Y m ) the transformed, real valued random variable. What is the distribution of S n,m?. p.11/25

Local alignment scores Denote by S n,m = h(x 1,...,X n, Y 1,...,Y m ) the transformed, real valued random variable. What is the distribution of S n,m? We can in principle compute its discrete distribution from the distribution of the X- and Y -variables futile and not possible in practice. It is possible to use simulations, but it may be quite time-consuming and not a pratical solution for current database usage.. p.11/25

Local alignment scores Under certain conditions on the scoring mechanism and the letter distribution a valid approximation, for n and m large, is for parameters λ, K > 0 P(S n,m x) exp( Knm exp( λx)) This is a scale location transformation S n,m = log(knm) λ + S n,m λ where S n,m has a Gumbel distribution.. p.12/25

Statistical models Example: We measure the expression of out favorite gene number i on a microarray. The additive noise model reads that our measurement can be written as X i = µ i + σ i ǫ i, where µ i R, σ i > 0 and ǫ i has mean 0 and variance 1. If ǫ i N(0, 1) we have X i N(µ i, σi 2 ) and we have fully specified our model with unknown parameters (µ i, σ i ) R (0, ).. p.13/25

Statistical models Example: We want to consider pairs of nucleotides (X i, Y i ) that are evolutionary related. We assume that they are independent and identically distributed and that P(X 1 = x, Y 1 = y) = p(x)p t (x, y) p(x)(0.25 + 0.75 exp( 4αt)) if x = y = p(x)(0.25 0.25 exp( 4αt)) if x y.. p.14/25

Statistical models We need a sample space E.. p.15/25

Statistical models We need a sample space E. We need a parameter space Θ of unknown parameters... p.15/25

Statistical models We need a sample space E. We need a parameter space Θ of unknown parameters.... and for each θ Θ we need a probability measure P θ on E.. p.15/25

Statistical models We need a sample space E. We need a parameter space Θ of unknown parameters.... and for each θ Θ we need a probability measure P θ on E. We call (P θ ) θ Θ a parameterized family of probability measures.. p.15/25

Exponential distribution Let E 0 = [0, ), let θ (0, ), and let P θ be the distribution of n iid exponentially distributed random variables X 1,...,X n with intensity parameter θ.. p.16/25

Exponential distribution Let E 0 = [0, ), let θ (0, ), and let P θ be the distribution of n iid exponentially distributed random variables X 1,...,X n with intensity parameter θ. The distribution of X i has density f θ (x) = θ exp( θx) for x 0. The probability measure P θ on E = E0 n = (0, ) n has density f θ (x 1,...,x n ) = θ exp( θx 1 )... θ exp( θx n ) = θ n exp( θ(x 1 +...+x n )).. p.16/25

Estimators Definition: An estimator is a map ˆθ : E Θ. For a given observation x E the value of ˆθ at x, ˆϑ = ˆθ(x), is called the estimate of θ.. p.17/25

Estimators Definition: An estimator is a map ˆθ : E Θ. For a given observation x E the value of ˆθ at x, ˆϑ = ˆθ(x), is called the estimate of θ. If X has distribution P θ, the transformed random variable ˆθ(X) is also called the estimator it has distribution ˆθ(P θ ).. p.17/25

Identifiability Definition: The parameter θ is said to be identifiable if the map θ P θ is one-to-one. That is, for two different parameters θ 1 and θ 2 the corresponding measures P θ1 and P θ2 differ.. p.18/25

Simulations Write a function, my.rexp, that takes two parameters such that > tmp <- my.rexp(10, 1) generates the realization of 10 iid random variables with the exponential distribution with parameter λ = 1. How do you make the second parameter to be equal to 1 by default such that > tmp <- my.rexp(10) produces the same result?. p.19/25

Solution > my.rexp <- function(n, lambda) { + -log(runif(n))/lambda + } To make λ = 1 by default we define instead > my.rexp <- function(n, lambda = 1) { + -log(runif(n))/lambda + } Note that we have used that if U is uniformly distributed on [0, 1] then 1 U is uniformly distributed on [0, 1] to get rid of unnecessary computations.. p.20/25

Maximum of random variables Use > tmp <- replicate(1000, max(rexp(10, 1))) to generate 1000 replications of the maximum of 10 independent exponential random variables. Plot the distribution function for the Gumbel distribution with location parameter log(10) and compare it with > emdf <- function(x) sapply(x, function(x) sum(tmp <= x)/1000) What if we take the max of 100 exponential random variables?. p.21/25

Solutions > x <- seq(0, 5, by = 0.1) > plot(x, exp(-exp(-(x - log(10)))), type = "l") > points(x, emdf(x), type = "p", pch = 20, col = "red"). p.22/25

Solutions 0 1 2 3 4 5 x. p.23/25 exp( exp( (x log(10)))) 0.0 0.2 0.4 0.6 0.8

Solutions > tmp <- replicate(1000, max(rexp(100, 1))) > emdf <- function(x) sapply(x, function(x) sum(tmp <= x)/1000) > x <- seq(0, 8, by = 0.1) > plot(x, exp(-exp(-(x - log(100)))), type = "l") > points(x, emdf(x), type = "p", pch = 20, col = "red"). p.24/25

Solutions 0 2 4 6 8 x. p.25/25 exp( exp( (x log(100)))) 0.0 0.2 0.4 0.6 0.8 1.0