Statistical distributions: Synopsis

Statistical distributions: Synopsis Basics of Distributions Special Distributions: Binomial, Exponential, Poisson, Gamma, Chi-Square, F, Extreme-value etc Uniform Distribution Empirical Distributions Quantile Normalisation

Basics of Distributions A random variable X has (cumulative) distribution function F (x) = Pr(X x) Any increasing function F is a distribution if 0 = F ( ) F (x) F ( ) = 1 F (x) F (y) when x < y 0.0 0.2 0.4 0.6 0.8 1.0 f 4 2 0 2 4 x Figure 1 : Normal Distribution N(1,1)

Continuous Distributions If X is continuous then it has density function f (x) = df dx Pr(X x) = F (x) = x f (t)dt 0.0 0.2 0.4 0.6 0.8 1.0 f 4 2 0 2 4 x Figure 2 : Normal Distribution N(1,1)

Discrete Distributions If X takes discrete values (usually integers) then it has the mass function f (x) = Pr(X = x) Pr(X x) = F (x) = t x f (t) Poisson Distribution with mean λ: Pr(X = x) = e λ λx x! y 0.00 0.05 0.10 0.15 0 5 10 15 20 x Figure 3 : Poisson Distribution λ = 5

Expectation The mean of the distribution E(X ) = µ = xf (x)dx The mean of a sample X = sample size n increases n i=1 X i n converges to E(X ) as the

Variance var(x ) = σ 2 = (x E(X )) 2 f (x)dx The standard deviation σ is measured on the same scale as X σ 2 = E(X E(X )) 2 ) Sample variance ˆσ 2 = n i=1 (X i X ) 2 n converges to σ 2 as n increases.

Additivity The expectation is additive: E(aX + by ) = ae(x ) + be(y ) E( X ) = E( X 1 + X 2 +... + X n ) = E(X 1) + E(X 2 ) +... + E(X n ) = µ n n The variance is almost additive: var(ax + by ) = a 2 var(x ) + b 2 var(y ) var( X 1 + X 2 +... + X n ) = var(x 1) + var(x 2 ) +... + var(x n ) n n 2 = σ2 n

Chebyshev s inequality True for any distribution with a variance. Relates the variance σ to probability of the extreme event X E(X ) Pr( k) σ Let I (V ) be the indicator function for an event V, so Pr(V ) = f (x) = E(I (V )) x V Pr( X E(X ) kσ) = E(I ( X E(X ) kσ)) ( ) X E(X ) 2 = E(I ( 1)) kσ ( ) X E(X ) 2 E( ) kσ = 1 k 2

Weak Law of Large Numbers The sample mean X is a random variable with expectation E(X ) and variance σ2 n. Applying Chebyshev s inequality (setting k = ɛ n) gives Pr( X E(X ) ɛ) 1 σ nɛ 2 For a fixed choice of ɛ, we can choose a sample size n so that this probability is as small as we wish. in large samples. Pr( X E(X ) σ ɛ) 0

Common Distributions Normal Binomial Poisson Negative Binomial Exponential Gamma Chi-Squared T F Extreme Value

Normal Distribution N(µ, σ 2 ) The Mother of all distributions density function f (x) = 1 (x µ)2 e 2σ 2 2πσ 2 Pr(X x) = Φ(x) = x f (t)dt Central Limit Theorem says the distribution of sample mean X of any distribution with a variance tends to N(µ, σ2 n )

Binomial Distribution B(n, p) Probability of r successes in n independent trials, each with probability p of success. Pr(X = r) = p r (1 p) n r n! r!(n r)! E(X ) = p var(x ) = p(1 p) As np 0, distribution of X /n N(p, p(1 p)) R functions are rbinom, dbinom, pbinom, qbinom

Multinomial Distribution Generalisation of the Binomial to multiple outcomes. A multivariate distribution of count data. Each observation takes one of K categorical values (e.g. 1-D contingency table). Probability of being in group i is p i. Out of a sample of N observations, N i are in group i. Probability of observing the vector (N 1, N 2,..., N K ): Pr(N 1, N 2,..., N K ) = N! i p N i i N i! E(N i ) = p i var(n i ) = p i (1 p i ) cov(n i, N j ) = p i p j

Poisson Distribution Po(λ) Probability of r independent events in a given time when the mean rate is λ. Pr(X = r) = e λ λr r! E(X ) = λ var(x ) = λ As λ, Po(λ) N(λ, λ) Additivity: if X Po(λ), Y Po(µ) then X + Y Po(λ + µ). Examples: Radioactive decay, Read coverage in next generation sequencing (not)

Negative Binomial Distribution NB(r,p) Distribution of the number of successful trials k until r failures, where each trial has probability p of success. The last trial is always a failure. Pr(X = k) = (1 p) r p k( ) k+r 1 k E(X ) = pr var(x ) = 1 p pr (1 p) 2 E(X ) Often used to model Over-dispersed Poisson distributions where variance exceeds the mean, e.g. in RNAseq data Po(λ) lim r > NB(r, λ λ+r )

Exponential Distribution Exp(λ) Distribution of times between Poisson events. Memoryless - the past does not influence future. F (t) = Pr(T < t) = 1 e λt f (t) = λe λt E(T ) = 1 λ var(t ) = 1 λ 2 log p-values Exp(1) waiting times between events in a Poisson Process Note - geometric distribution is discrete analogue of exponential, gives distribution of the number of trials until first failure p r (1 p)

Gamma Distribution Γ(n, λ) Distribution of sum of n independent exponential random variables Pr(T < t) = E(T ) = n λ var(t ) = n λ 2 t 0 (λt)n 1 e λt dt Γ(n) NOTE: In R, dgamma() etc parameters are a, s corresponding to n, 1 λ

Example: Fisher s Method for Combining P-values Often we only have p-values (not the underlying data) e.g. from a collection of GWASs. We want to test if collections of p-values are significant (e.g. for all SNPs in a gene). We are primarily interested in the smaller p-values, so looking at sums of p-values is not optimal because it is dominated by uninteresting large p-values Instead, consider S = i log(p i), which is dominated by contributions from smaller p-values. If the p-values are independent then s is distributed like a sum of n exponential random variables where λ = 1, i.e. S Γ(n, 1).

Chi-squared Distribution χ 2 n Distribution of the sum of n squared Normal random variables. Also a special case of the Gamma distribution: Applications: Likelihood ratio tests χ 2 n Γ(2n, 1) Distribution of sample variance Contingency Tables

T Distribution T n Distribution of T = Z S n of the ratio of a standard normal random variable Z N(0, 1) to the square root of Chi-square S χ 2 n on n df. f (x) = n+1 Γ( 2 ) nπγ( n 2 )(1 + x 2 n+1 ) 2 n where Γ(n) = 0 t n 1 e t dt Used in the T-test and to compute confidence intervals Note that when n = 1 then T 1 is the same as the Cauchy distribution and does not have a finite variance

F Distribution F (n, m) Distribution of the ratio of two scaled Chi-squared distributions on n, m df. Used in ANOVA and for comparing variances If X 1 χ 2 n, X 2 χ 2 m, then W = X 1m X 2 n (nw) n m m F (n, m). Density is f (w) = (nw+m) n+m w 1 0 tn/2 (1 t) m/2 dt E(W ) = m m 2 F (n, ) χ2 n n F (1, m) T 2 m

Extreme-Value Distributions (EVD) Distribution of M n = max(x 1, X 2,...X n ). Three limiting distributions depending on the underlying distribution of X i (compare to Central Limit Theorem...). We can find a n, b n such that Pr( Mn an b n < t) G(t) { exp( ( t a Weibull: G(t) = b )α ) t < a 1 otherwise Gumbel: G(t) = exp( e t a b ) { 0 t a Frechet: G(t) = exp( ( t a b ) α ) otherwise see R package GEV

EVD Example: Longest run R of succeses in N trials Consider the number M(r) of runs of length at least r. The Pr(R < r) = Pr(M(r) = 0). Pr(run of at least r at a given position) = (1 p)p r When r is large then long runs are very rare events, and will be Poisson distributed with mean µ = N(1 p)p r Pr(M(r) = 0) = e µ = e N(1 p)pr = exp( e r a b )) where b = log p, a b = log N(1 p) Gumbel distribution This argument can be used to model rare events as Poisson processes

Estimating Probability Densities - Kernel Density Estimation Data X 1, X 2,...X N from an unknown distribution. Density function can be estimated by superimposing many tiny distributions, each with variance σ 2 centred on an X i. f (x) = 1 N Often φ(x) is the density of a standard Normal N(0, 1). σ controls the degree of smoothing. see R function density() i φ( x X i σ ) density.default(x = u$mt[u$iscase == 0]) Density 0.00 0.01 0.02 0.03 0 50 100 150 200 250 300 N = 5650 Bandwidth = 1.927 Figure 4 : Estimated Density

Fitting a distribution R function fitdistr in library MASS Fits a parametric distribution to a sample by maximum likelihood eg fitdistr ( x, gamma ) density.default(x = u$mt[u$iscase == 0]) Density 0.00 0.01 0.02 0.03 0 50 100 150 200 250 300 N = 5650 Bandwidth = 1.927 Figure 5 : Estimated Density and Gamma fit

Empirical Cumulative Distribution Function (ECDF) Sample x 1, x 2,...x n. Define the indicator function I i (t) = 1 if x i t Then the ECDF ˆF (t) = Ii (t) n i.e. fraction of sample t. Asymptotically ˆF (t) F (t) as n R function ecdf() will compute the ECDF of a sample.

The Uniform Distribution X is uniformly distributed between 0 and 1. X U(0, 1) X is uniformly distributed between 0 and 1. F (x) = Pr(X < x) = x, (0 x 1) f (x) = 1, (0 x 1) R functions for U(0,1) are punic, dunif, qunif, runif z 0.0 0.2 0.4 0.6 0.8 1.0 1 0 1 2 3 x

Quantile Normalisation Let X have distribution function F (x). Let u = F (x) Then the random variable U = F (X ) is uniformly distributed: Pr(U u) = Pr(F (X ) F (x)) = Pr(X x) = F (x) Pr(U u) = u So we can transform any distribution into the uniform Equally we can transform a uniform to any distribution using the quantile function F 1 (U)

Quantile Normalisation To transform a sample to a target distribution with CDF Φ(t): ECDF quantile normalises a sample to the uniform distribution: x i ˆF (x i ) Φ 1 (ˆF (x i )) has the desired distribution R code to quantile normalise a vector X to a Normal: n = length(x)+1 r = rank(x)/n q = qnorm(r) # Fhat # Phi_inv

Quantile Normalisation Transform data with awkward distribution to a better one Often stabilises behaviour of P-values, but can lose power Useful in GWAS

Quantile-Quantile Plots Does a sample x 1, x 2...x n come from a given distribution F? Sort y 1 y 2... y n and quantile normalise to a uniform F 1 (y 1 ) F 1 (y 2 )... F 1 (y n ) The expected quantiles of a uniform distribution are 1 n + 1, 2 n + 1,... n n + 1 E(F 1 (y k )) = k n + 1 If F is correct then expect to observe linear plots of F 1 (y k )) vs E(F 1 (y k )) log 10 F 1 (y k )) vs log 10 E(F 1 (y k ))

QQ Plots Examples observed quantile 3 2 1 0 1 2 3 2 1 0 1 2 3 theoretical quantile Figure 7 : black: sample from N(0,1), red: sample from T 2

QQ Plots Examples Most common scenario is when the sample X 1...X n are millions of log p-values from a GWAS. Inflation of P-values can occur because of: Population Structure Non-normality of the phenotype (for quantitative traits) Linkage Disequilibrium Unknown Reasons...

Comparing Distributions: Kolmogorov-Smirnov Test Are two continuous ECDFs ˆF 1 (x), ˆF 2 (x) the same? D = max x ˆF 1 (x) ˆF 2 (x) Distribution of D is known, and is independent of ˆF 1 (x), ˆF 2 (x) so can be used to compare them. It can also compare an ECDF to a given known distribution. The distribution of the KS statistic D also provides confidence intervals for QQ plots. The KS test is implemented in the R function ks.test() Other tests for comparing distributions include the Chi-squared test (especially for discrete valued distributions), Anderson-Darling test.