Gene Expression as a Stochastic Process: From Gene Number Distributions to Protein Statistics and Back

Gene Expression as a Stochastic Process: From Gene Number Distributions to Protein Statistics and Back June 19, 2007

Motivation & Basics A Stochastic Approach to Gene Expression Application to Experimental Data Summary & Outlook

Gene Copy Number and Transfection A big hope of gene therapy is to treat diseases by use of artificial viruses, that bring genes (coding for beneficial proteins) into the cell.

Gene Copy Number and Transfection A big hope of gene therapy is to treat diseases by use of artificial viruses, that bring genes (coding for beneficial proteins) into the cell. Bad Treatment: Heterogeneous distribution of plasmids: Many cells get no plasmids, a few cells get many plasmids.

The Central Dogma of Biology After import of genetic material, genes are expressed by the cellular machinery via transcription and translation. Each reaction is an inherently stochastic processes and thus a spread of in protein numbers is found after gene expression.

Intrinsic and Extrinsic Noise In biological systems noise arises from two sources: 1. Due to probabilistic nature of chemical reactions: Intrinsic Noise Can be treated by means of probability calculus: Master-, Fokker-Planck-Equation, Simulations. 2. Due to variations in rate constants (different cell volume, temperature, cell cycle state, number of enzymes, etc.): Extrinsic Noise Usually unknown nature and strength.

Deterministic Approach Assume the gene number D fixed. R(t) t P(t) t = λ 1 D δ 1 R(t) = λ 2 R(t) δ 2 P(t) These equations can be solved successively: R(t) = D λ 1 δ 1 (1 e δ1t ) The expression for P(t) is more complicated, but one finds P(t ) = D C with the expression factor C := λ1λ2 δ 1δ 2, which gives the number of proteins per gene.

Stochasticity - The Master Equation However, transcription, translation and degradation are stochastic processes. Probabilistic approach: Master equation We have a 2d state space, each state is characterized by by R and P. Usually we would need to deal with p R,P. Instead we split up the problem into two Master equations: p R t p P t = λ 1 Dp R 1 + δ 1 (R + 1)p R+1 (λ 1 D + δ 1 R)p R = λ 2 R(t)p P 1 + δ 2 (P + 1)p P+1 (λ 2 R(t) + δ 2 P)p P The first equation is decoupled from the second and can be solved exactly, while the second one is more tricky...

mrna Distribution The solution to p R t = λ 1 Dp R 1 + δ 1 (R + 1)p R+1 (λ 1 D + δ 1 R)p R is given by a Poisson distribution where p R (t) = µ 1(t) R e µ1(t) R! µ 1 (t) = D λ 1 δ 1 ( 1 e δ 1 t ) is the mean mrna number, as also given by the deterministic rate equations.

Interlude: The Poisson Distribution Some properties: One-parametric distribution, i.e. the mean X fully determines the distribution. The mean is equal to the variance: X = var(x ) For large mean, by the central limit theorem, a Poissonian is equivalent to a Gaussian.

Protein Distribution p P = λ 2 R(t)p P 1 + δ 2 (P + 1)p P+1 (λ 2 R(t) + δ 2 P)p P t is analogous to the Master equation for p R, apart from the random variable R(t) taking the place of D. The solution is yet again a Poisson distribution: p P (t) = µ 2(t) P e µ2(t) P! Now the mean is a functional of R(t): t µ 2 [R(t)] = (λ 2 ) R(t )e δ2 t dt e 0 1 t t δ2 t δ = 2 λ 2 0 R(t )e δ2 t dt δ t 2 0 eδ2 t dt This is a weighted temporal average of R(t), where the weighting function is exp(δ 2 t). The recent past has the most weight!

Separation of Time Scales: 1) mrna kinetics 1/δ 2 R(t) changes rapidly compared to the lifetimes 1 of proteins δ 2 i.e. R(t) totally explores its distribution while the proteins in each cell only see the average R(t) = µ 1 : µ 2 (t) = λ t 2 0 R(t )e δ2 t dt δ t 2 0 eδ2 t dt

Separation of Time Scales: 2) mrna kinetics 1/δ 2 R(t) changes sluggishly, while proteins follow that signal and equilibrate to new steady state, forgetting the past very fast. The mean of the P is determined only by the recent past of R(t), which can be assumed to be constant in that period. For cells which have R mrnas presently, the proteins have a Poisson distribution with mean µ 2 (t) = λ t 2 0 R eδ2 t dt δ t = λ 2 R. 2 0 eδ2 t dt δ 2 For the whole population we have to sum up all possible states of R, each with the weight according to its probability: p P = R=0 p R ( ) P λ2 δ 2 R P! e λ 2 δ 2 R

Separation of Time Scales: 2) mrna kinetics 1/δ 2 Examples The distribution of mrna is still visible in the distribution of proteins. Note: If R = 0 then the Poissonian for P collapses to a peak at P = 0 with height p R=0.

Random Number of Genes Upon viral infection, transfection or generally in bacteria carrying plasmids or minichromosomes, the number of genes varies from individual to individual. Thus D is not longer constant, but itself a random variable, subject to a distribution p D. In general, to find the protein distribution pp tot for the whole population we have to sum over the protein distributions p P (D) of subpopulations with gene copy numbers D according to their respective probabilities: pp tot = p D p P (D) D=0 Since this expression can t, in general, be determined explicitly, we stick to the biological relevant case mrna kinetics 1/δ 2, as discussed above. Again we find a sum of Poissonians: p P = D=0 p D µ P 2 P! e µ2 = D=0 (DC) P p D e DC P!

Random Number of Genes Why is this interesting? p P = D=0 (DC) P p D e DC P!

Random Number of Genes p P = D=0 (DC) P p D e DC P! Why is this interesting? Properties of the Poisson Distribution and C often 1! 1. For C 1 the Poissonians have large mean can be approximated by Gaussians! 2. Distance between means of two adjacent Poissonians is C while their respective widths go like σ = DC. significant overlap only for D > (C 1)2 4C

From the Protein Distribution to Copy Number Statistics Examples

From the Protein Distribution to Copy Number Statistics Examples While separation of the Gauss peaks is still much greater then their widths one can even approximate then by a sum of delta peaks: p P = p D δ P,D C ; D N 0 Discretized approximation Mean P Variance σ 2 (P) Sum of Poissonians 500 5.05 10 4 Sum of Gaussians 500 5.05 10 4 Sum of Gaussians with ηext = 0.1 500 5.35 10 4 Sum of δ-peaks 500 5.00 10 4

Single Cell Protein Measurements By single cell studies it is possible to obtain protein numbers of single cells (e.g. by use of GFP and derivatives), but the gene number distribution cannot be measured directly and sometime rate constants and expression factor are unknown. In these cases the above theory can be applied, if C 1 and mrna kinetics 1/δ 2 : 1. Compute mean P and variance var(p) of measured protein numbers. 2. Use discretized approximation: Mean and variance are homogeneous functions of degree 1 and 2, respectively. C = var(p) P 3. Compute the mean gene copy number D = P C. 4. If the gene copy number distribution is Poisson (meaningful for transfection), then we know everything about it! 5. From the found p D we can compute the theoretical p P and compare to the measured protein distribution as a check for consistency.

Results Non-fluorescent cells allow for independent measurement. Strong noise and bias to the left call for improved experiments and data analysis. C from C from C from p D=0 P σ 2 (P) D p D=0 and P p D=0 and σ 2 (P) P and σ 2 (P) PEI synch. 0.4 4.46 10 6 9.44 10 12 1.38 3.49 10 6 2.46 10 6 3.24 10 6 PEI asynch. 0.23 2.56 10 6 5.84 10 12 1.29 2.25 10 6 1.26 10 6 1.99 10 6 Lipo synch. 0.3 5.91 10 6 1.65 10 13 1.38 4.97 10 6 2.54 10 6 4.29 10 6 Lipo asynch. 0.3 3.75 10 6 1.20 10 13 1.29 3.15 10 6 2.16 10 6 2.90 10 6

Summary: Distributions give us information about the underlying processes. Expression factor C := λ1λ2 δ 1δ 2 can be obtained from protein distribution, yielding a functional relationship between the rates. Mean number of genes D and even distribution of genes can be computed. Transfection process can be tested for quality. Outlook: Incorporate promotor activity, poly-a-mrna-degradation, etc. into analysis. Check derived results by tuning rates: modification of promotor sequence, destabilizing proteins, mutations in the gene s open reading frames... Improve experimental setup, better data analysis, reduce extrinsic noise.

Thanks for your attention!