APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65

APM 504: Probability Notes Jay Taylor Spring 2015 Jay Taylor (ASU) APM 504 Fall 2013 1 / 65

Outline Outline 1 Probability and Uncertainty 2 Random Variables Discrete Distributions Continuous Distributions 3 Multivariate Distributions 4 Sums of Random Variables Jay Taylor (ASU) APM 504 Fall 2013 2 / 65

Probability and Uncertainty Probability: Interpretations and Basic Principles Frequentist Interpretation The probability of an event is equal to its limiting frequency in an infinite series of independent identical trials. Bayesian Interpretations Logical: The probability of a proposition is equal to the strength of evidence in favor of the proposition. Subjective: The probability of a proposition is equal to the strength of an individual s belief in the proposition. Jay Taylor (ASU) APM 504 Fall 2013 3 / 65

Probability and Uncertainty Although probabilities can be interpreted in different ways, these interpretations are usually based on the same mathematical rules. To describe these, we will use P(E) to denote the probability of an event or proposition E. Probability Axioms 1 If S is certain to be true, then P(S) = 1. 2 0 P(E) 1 for any proposition E. 3 If E and F are mutually exclusive propositions, then the probability that either E or F is true is equal to the sum of the probabilities of E and of F : P(E or F ) = P(E) + P(F ). Jay Taylor (ASU) APM 504 Fall 2013 4 / 65

Probability and Uncertainty A more formal treatment can be given using Kolmogorov s notion of a probability space. Definition A probability space is a triple (Ω, F, P) consisting of the following objects: 1 A set Ω called the sample space. 2 A collection F of subsets of Ω, called a σ-algebra, satisfying: Ω F; if A F, then Ā = Ω\A F; if A 1, A 2, F, then i=1 A i F. 3 A probability measure P : F R satisfying: P(A) 0 for every A F; P(Ω) ( = 1; ) P i 1 A i = i 1 P(A i) for any countable collection of disjoint sets A i F. Jay Taylor (ASU) APM 504 Fall 2013 5 / 65

Probability and Uncertainty The axioms of probability imply several useful properties: P(Ā) = 1 P(A); P( ) = 0; If A B, then P(A) P(B). P(A B) = P(A) + P(B) P(A, B); General inclusion-exclusion formula: ( n ) n P A i = P(A i ) P(A i A j ) + P(A i A j A k ) + i=1 i=1 i<j ( 1) n+1 P(A 1A 2 A n). i<j<k Exercise: Show that these properties follow from the definition. Jay Taylor (ASU) APM 504 Fall 2013 6 / 65

Probability and Uncertainty The probability assigned to a proposition depends on the information or evidence available to us. This can be made explicit through conditional probability. Conditional Probability Suppose that E and F are propositions and that P(E) > 0. If we know E to be true, then the conditional probability of F given E is equal to P(F E) = P(E and F ). P(E) P(S) = 1 S P(E) = 0.46, P(F ) = 0.33 P(E and F ) = 0.21 P(F E) = 0.21/0.46 0.47 E E and F F Jay Taylor (ASU) APM 504 Fall 2013 7 / 65

Probability and Uncertainty Joint probabilities can often be calculated by conditioning on one of the propositions. Product Rule P(E and F ) = P(E) P(F E). Example: Suppose that two balls are sampled without replacement from an urn containing five red balls and five blue balls. If we let E be the event that the first ball sampled is red and F be the event that second ball sampled is red, then the probability that both balls sampled are red is P(E, F ) = P(E)P(F E) = 5 10 4 9 = 2 9. Jay Taylor (ASU) APM 504 Fall 2013 8 / 65

Probability and Uncertainty Because we can condition on either E or F, the joint probability of E and F can be decomposed in two different ways using the product rule: { P(E) P(F E) (conditioning on E) P(E and F ) = P(F ) P(E F ) (conditioning on F ). It follows that the two expressions on the right-hand side are equal, i.e., P(E) P(F E) = P(F ) P(E F ), and if we then divide both sides by P(E), we arrive at one of the most important formulas in probability theory: Bayes Formula P(F E) = P(F ) P(E F ). P(E) Jay Taylor (ASU) APM 504 Fall 2013 9 / 65

Probability and Uncertainty The denominator in Bayes formula can often be calculated with the help of the following formula. The Law of Total Probability Suppose that F 1,, F n are mutually exclusive events and that B is an event contained in the union F 1 F n. Then P(B) = = n P(B, F i ) i=1 n P(F i ) P(B F i ). i=1 In other words, the probability of B is equal to the average of the conditional probabilities P(B F i ), weighted by the probabilities of the events being conditioned on. Jay Taylor (ASU) APM 504 Fall 2013 10 / 65

Probability and Uncertainty Example: Reversed Sexual Size Dimorphism in Spotted Owls Like many raptors, adult female Spotted Owls (Strix occidentalis) are larger, on average, than their male counterparts. For example, a study of a California population found that the wing chord distribution (in mm) is approximately N (329, 6) in females and N (320, 6) in males (Blakesley et al., 1990, J. Field Ornithology). Wing Chord in Spotted Owls male female density wing chord Jay Taylor (ASU) APM 504 Fall 2013 11 / 65

Probability and Uncertainty Problem: Suppose that an adult bird with a wing chord of 329 mm is randomly sampled from a population with a 1 : 1 adult sex ratio. What is the probability that this is a female? Solution: Let F (resp., M) be the event that the bird is female (resp., male) and let W be the event that the wing chord is 329 mm. Then P(F ) = 0.5 p(w F ) = 1 6 2 /72 2π e (129 129) 0.0665 p(w ) = P(F )p(w F ) + P(M)p(W M) = 1 0.5 6 2 /72 1 2π e (129 129) + 0.5 6 2 /72 2π e (129 120) 0.0441 and upon substituting these quantities into Bayes formula we find that P(F W ) = P(F ) p(w F ) p(w ) 0.75. Jay Taylor (ASU) APM 504 Fall 2013 12 / 65

Probability and Uncertainty In general, there is no simple relationship between the probabilities P(A), P(B) and P(A, B). However, there is an important special case where the three probabilities are related by a simple identity. Independence Events A and B are independent if P(A, B) = P(A) P(B). Events A 1, A 2, A 3, are independent if every finite collection {A i1, A i2,, A in } of distinct events satisfies the identity: ( n ) n P A ij = P ( ) A ij. j=1 j=1 Jay Taylor (ASU) APM 504 Fall 2013 13 / 65

Probability and Uncertainty Colloquially, we say that two events are independent if there is no causal or logical relationship between them, i.e., knowing that B is true does not change the likelihood that A is true. Indeed, if A and B are independent and P(A) > 0 and P(B) > 0, then P(A B) = P(A, B) P(B) = P(A)P(B) P(B) = P(A) P(B A) = P(A, B) P(A) = P(A)P(B) P(A) = P(B), which shows that our formal definition of independence is consistent with this heuristic interpretation. However, the formal definition is slightly broader in that it applies even when one or more of the events has probability 0. Jay Taylor (ASU) APM 504 Fall 2013 14 / 65

Random Variables Random Variables Suppose that (Ω, F, P) is a probability space that represents our beliefs concerning the state of some system. In many cases, it will not be possible to directly observe which state ω Ω the system occupies, but it will be possible to conduct experiments that provide some information about this state. Mathematically, we can model such an experiment by a function defined on the sample space which takes values in a set E. X : Ω E Since the value of the variable X (ω) depends on the unknown state ω of the system, we say that X is a random variable. If we wish to emphasize the range of X, then we say that X is an E-valued random variable. Jay Taylor (ASU) APM 504 Fall 2013 15 / 65

Random Variables Distribution of a Random Variable Suppose that X : Ω E is a random variable defined on a probability space (Ω, F, P). The distribution of X is the probability distribution P X on E defined by the following identity: P X (A) P(X A) = P(X 1 (A)). Notice that the two expressions on the right-hand side are determined by the probability measure P on Ω, i.e., the distribution of a random variable depends on the probability distribution on the underlying space on which the variable is defined. Technical aside: To make this definition rigorous, we need to introduce a σ-algebra E on E and require that X 1 (A) F whenever A E. X is said to be measurable with respect to the two σ-algebras F and E if this condition is satisfied. The distribution of X is then a probability measure on the measurable space (E, E). Jay Taylor (ASU) APM 504 Fall 2013 16 / 65

Random Variables Discrete Distributions Discrete Random Variables Definition A random variable X is said to be discrete if X takes values in a set E that is either finite or countably infinite. The probability mass function of a discrete random variable X is the function p X : E [0, 1] defined by the formula p X (e) = P(X = e). Example: A random variable X with values in the set E = {0, 1} and probability mass function p X (1) = p, p X (0) = 1 p is said to be a Bernoulli random variable with parameter p. Jay Taylor (ASU) APM 504 Fall 2013 17 / 65

Random Variables Discrete Distributions The probability mass function of a discrete random variable completely determines its distribution via the following identity. Calculating probabilities with probability mass functions Let X be a discrete random variable with probability mass function p X : E [0, 1]. Then, for any subset A E, P(X A) = x A p X (x). In other words, to calculate the probability that X belongs in A, we simply sum the probability mass function over all of the values in A. In particular, when the probability mass function is summed over the entire space, the sum must be equal to 1: p X (x) = P(X E) = 1. x E Jay Taylor (ASU) APM 504 Fall 2013 18 / 65

Random Variables Discrete Distributions Expectations of Discrete Variables If X is a discrete random variable with values in a subset of the real numbers, then the expected value (expectation, mean) of X is the weighted average of these values: E[X ] = x E p X (x) x. Example: If X is Bernoulli with parameter p, then the expected value of X is E[X ] = p 1 + (1 p) 0 = p. Notice that the expected value of a random variable need not belong to the range of the variable. Jay Taylor (ASU) APM 504 Fall 2013 19 / 65

Random Variables Discrete Distributions Properties of Expectations The following two properties are often useful: Linearity: n E[c 1X 1 + + c nx n] = c i E[X i ] Transformations: If f : E R is a real-valued function, then E[f (X )] = p X (x) f (x) x E i=1 Caveat: If f is a non-linear function, then in general E[f (X )] f (E[X ]). Jay Taylor (ASU) APM 504 Fall 2013 20 / 65

Random Variables Discrete Distributions Variance If X is a discrete random variable with values in a subset of the real numbers, then the variance of X is the weighted average of the squared difference between X and its mean: [ Var(X ) E (X E[X ]) 2] = p X (x) (x E[X ]) 2. x E In practice, it is often more convenient to calculate the variance using the following formula: Var(X ) = E [ X 2] E[X ] 2. Example: If X Bernoulli(p), then, since X 2 = X, Var(X ) = p p 2 = p(1 p). Jay Taylor (ASU) APM 504 Fall 2013 21 / 65

Random Variables Discrete Distributions Binomial Distribution X is said to have the binomial distribution with parameters n 1 and p [0, 1], written X Binomial(n, p), if X takes values in the set E = {0, 1,, n} with probability mass function P(X = k) = ( n k ) p k (1 p) n k, k = 0, 1,, n. Furthermore, the mean and the variance of X are E[X ] = np Var(X ) = np(1 p). Application: Suppose that we perform a series of n independent, identical (IID) trials, each of which results in a success with probability p or a failure with probability 1 p. Then the total number of successes in the n trials is a binomial random variable with parameters n and p. In particular, a Bernoulli random variable with parameter p is also a binomial random variable with parameters n = 1 and p. Jay Taylor (ASU) APM 504 Fall 2013 22 / 65

Random Variables Discrete Distributions Geometric Distribution X is said to have the geometric distribution with parameter p [0, 1], written X Geometric(p), if X takes values in the non-negative integers E = {0, 1, } with probability mass function P(X = k) = (1 p) k p, k 0. Furthermore, the mean and the variance of X are E[X ] = 1 p p Var(X ) = 1 p p 2. Application: Suppose that we perform a series of IID trials, each of which results in a success with probability p. Then the number of failures that occur before we observe the first success is geometrically distributed with parameter p. Alternate definitions: Some authors define the geometric distribution to be the number of trials required to obtain the first success, in which case X takes values in the set E = {1, 2, } and P(X = k) = (1 p) k 1 p. Jay Taylor (ASU) APM 504 Fall 2013 23 / 65

Random Variables Discrete Distributions Negative Binomial Distribution X is said to have the negative binomial distribution with parameters r > 0 and p [0, 1], written X NB(r, p), if X takes values in the non-negative integers E = {0, 1, } with probability mass function ( ) r + k 1 P(X = k) = p r (1 p) k, k 0. k Furthermore, the mean and the variance of X are E[X ] = r(1 p) p Var(X ) = r(1 p) p 2. Application: Suppose that we perform a series of IID trials, each of which results in a success with probability p. Then the number of failures that occur before the r th success is a negative binomial random variable with parameters r and p. Jay Taylor (ASU) APM 504 Fall 2013 24 / 65

Random Variables Discrete Distributions The following plot shows the probability mass function for a series of negative binomial distributions, with fixed success probability p = 0.5 and increasing r = 0.2, 1, 5, 10. When r = 1, this distribution reduces to a geometric distribution. However, as r increases, the distribution becomes more symmetric (less skewed) around its mean. 0.9 NegativeBinomial Distributions 0.8 0.7 r =0.2 r =1 r =5 r=10 0.6 0.5 p(k) 0.4 0.3 Geometric(0.5) 0.2 0.1 0 0 2 4 6 8 10 12 14 16 18 20 k Jay Taylor (ASU) APM 504 Fall 2013 25 / 65

Random Variables Discrete Distributions Poisson Distribution X is said to have the Poisson distribution with parameter λ 0, written X Poisson(λ), if X takes values in the non-negative integers E = {0, 1, } with probability mass function P(X = k) = e λ λ k k!, k 0. Furthermore, the mean and the variance of X are: E[X ] = Var(X ) = λ. Application: Suppose that we perform a large number n of IID trials, each with success probability p = λ/n. Then the total number of successes is approximately Poisson distributed with parameter λ and this approximation becomes exact in the limit as n. This is a special case of the Law of Rare Events. Jay Taylor (ASU) APM 504 Fall 2013 26 / 65

Random Variables Discrete Distributions The Poisson distribution provides a useful approximation for the distribution of many kinds of count data. Some examples include: The number of misspelled words on a page of a book. The number of misdialed phone numbers in a city on a particular day. The number of beta particles emitted by a 14 C source in an hour. The number of major earthquakes that occur in a year. The number of mutated sites in a gene that differ between two closely-related species. Jay Taylor (ASU) APM 504 Fall 2013 27 / 65

Random Variables Discrete Distributions The Poisson distribution is less suited for modeling count data corresponding to events or individuals that are clumped. In such cases, the data are usually overdispersed, i.e., the variance is greater than the mean, and the negative binomial distribution may provide a better fit. p(k) 0.2 0.18 0.16 0.14 0.12 0.1 0.08 NegativeBinomial vs. Poisson (mean=4) Poisson(5) r =1 r =2 r =5 r = NB: σ 2 = µ + 1 r µ2 r is an inverse measure of aggregation or dispersion: smaller values of r give a more skewed distribution If we let r with p chosen so that the mean is µ, then NB(r, p) Poisson(µ) 0.06 0.04 0.02 0 0 2 4 6 8 10 12 14 16 18 20 k examples: numbers of macro parasites per individual, numbers of infections per outbreak are usually modeled with the negative binomial distribution Jay Taylor (ASU) APM 504 Fall 2013 28 / 65

Random Variables Continuous Distributions Continuous Distributions A real-valued random variable X is said to be continuous if there is a non-negative function p : R [0, ], called the probability density function of X, such that P(X A) = p(x)dx, for any (nice) subset A R. A Technical aside: A set is nice if it is belongs to the Borel σ-algebra on R. This is the smallest σ-algebra on R that contains all open intervals. While we will largely ignore measurability in this course, this is an important concept in fully rigorous treatments of probability theory. See David Williams book Probability with Martingales (1991). Jay Taylor (ASU) APM 504 Fall 2013 29 / 65

Random Variables Continuous Distributions The probability density function plays the same role for continuous random variables that the probability mass function plays for discrete random variables. Be careful, however, not to confuse the two. The existence of a density function has several consequences: Densities integrate to 1: Points have zero probability mass: p(x)dx = P(X R) = 1. P(X = t) = t t p(x)dx = 0. These two identities seem to lead to a paradox. On the one hand, the probability that X is equal to any particular point x is zero for every x. On the other hand, the probability that X is equal to some point, i.e., X R, is 1. However, there is no contraction here, since there are uncountably infinitely many real numbers and we have not required that probability distributions be additive over arbitrary collections of disjoint sets: ( ) 1 = P(X R) = P {X = x} P(X = x) = 0. x R x R Jay Taylor (ASU) APM 504 Fall 2013 30 / 65

Random Variables Continuous Distributions Cumulative Distribution Function If X is a real-valued random variable (not necessarily continuous), the cumulative distribution function of X is the function F : R [0, 1] defined by F (x) = P(X x). If X is continuous, then the density p(x) and the cumulative distribution function F (x) are related in the following way: F (x) = x p(t)dt and p(x) = F (x). Furthermore, the density of X at a point x can be estimated by the following formulas: p(x) P(x ɛ < X x + ɛ) 2ɛ = F (x + ɛ) F (x ɛ). 2ɛ Jay Taylor (ASU) APM 504 Fall 2013 31 / 65

Random Variables Continuous Distributions Expectations of Continuous Random Variables If X is a continuous random variable with probability density function p(x), then the expected value of X is the weighted average of these values: E[X ] = x p(x)dx. In general, expectations of continuous random variables behave like expectations of discrete random variables provided that we replace the sum by an integral and the probability mass function by the probability density function. For example, if f : R R is a real-valued function and X is as in the definition, then E[f (X )] = f (x) p(x)dx. Jay Taylor (ASU) APM 504 Fall 2013 32 / 65

Random Variables Continuous Distributions Uniform Distribution X is said to have the uniform distribution on the interval (l, u), written X U(l, u), if X takes values in the bounded set E = (l, u) with probability density function p(x) = 1 u l, Furthermore, the mean and the variance of X are: E[X ] = l + u 2 x (l, u). Var(X ) = (u l)2. 12 If l = 0 and u = 1, then X is said to be a standard uniform random variable. Application: In Monte Carlo simulations, we usually must transform a sequence of independent standard uniform random variables into a sequence of random variables with the target distribution. Jay Taylor (ASU) APM 504 Fall 2013 33 / 65

Random Variables Continuous Distributions Beta Distribution X is said to have the Beta distribution with parameters a, b > 0, written X Beta(a, b), if X takes values in the set E = (0, 1) with probability density function p(x) = 1 β(a, b) x a 1 (1 x) b 1, x (0, 1). Here, β(a, b) is the Beta function, which is defined by the integral β(a, b) = 1 Furthermore, the mean and the variance of X are: 0 x a 1 (1 x) b 1 dx. E[X ] = a a + b Var(X ) = ab (a + b) 2 (a + b + 1). Remark: If a = b = 1, then the Beta distribution reduces to the standard uniform distribution. Jay Taylor (ASU) APM 504 Fall 2013 34 / 65

Random Variables Continuous Distributions Application: The Beta distribution provides a flexible family of probability distributions for quantities that take values in the interval [0, 1], e.g., proportions or probabilities. The figure below shows that, depending on whether the parameters are less than or greater than 1, the Beta density will either be bimodal, with maxima at x = 0, 1, or unimodal, with a mode at (a 1)/(a + b 2). 3.5 Beta distribution p(x) 3 2.5 2 1.5 1 a =b =0.1 a =b =2 Other applications include: order statistics for IID uniform variables posterior distribution of success probabilities equilibrium frequencies of neutral alleles in panmictic populations 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Jay Taylor (ASU) APM 504 Fall 2013 35 / 65

Random Variables Continuous Distributions Exponential Distribution X is said to have the exponential distribution with rate parameter λ > 0, written X Exp(λ), if X takes values in the set E = [0, ) with probability density function p(x) = λe λx, x 0. Furthermore, the mean and the variance of X are E[X ] = 1 λ Var(X ) = 1 λ 2, while the cumulative distribution function is F (t) = P(X t) = 1 e λt. Application: Exponential random variables are often used to model random times, e.g., the time until a radioactive nucleus decays. Of course, the shape of such a distribution should not depend on the units of measurement and this is true of the exponential distribution. In particular, if X is exponentially distributed with rate λ and Y = γx, then Y is exponentially distributed with rate λ/γ. Jay Taylor (ASU) APM 504 Fall 2013 36 / 65

Random Variables Continuous Distributions Memorylessness of Exponential Distributions The exponential distributions are the only distributions that satisfy the following property. Given t, s > 0, P(X > t + s X > t) = P(X > t + s) P(X > t) = e λ(t+s) e λt = e λs = P(X > s). In other words, the variable has no memory of having survived until time t: the probability that it survives from time t to time t + s is the same as the probability of surviving from time 0 to time s. Put another way, the rate at which the variable dies is constant in time. Jay Taylor (ASU) APM 504 Fall 2013 37 / 65

Random Variables Continuous Distributions There is also an important connection between the geometric distribution and the exponential distribution. Approximation of the Geometric Distribution by the Exponential Suppose that for each n 1, X n is a geometric random variable with success probability p n = λ/n and let Y n = 1 n Xn. Each of these variables has mean E[Yn] = λ 1. Furthermore, if we let n tend to infinity, then P(Y n > t) = P(X n > nt) = where X is exponentially distributed with rate λ. ( 1 λ ) nt n e λt = P(X > t) n In other words, we can approximate a geometric random variable with small success probability by an exponential random variable provided that we change the units in which time is measured. Jay Taylor (ASU) APM 504 Fall 2013 38 / 65

Random Variables Continuous Distributions Gamma Distribution X is said to have the gamma distribution with shape parameter α > 0 and rate parameter λ > 0, written X Gamma(α, λ), if X takes values in the set E = [0, ) with probability density function p(x) = λα Γ(α) x α 1 e λx, x 0. Here, Γ(α) is the Gamma function, which is defined for α > 0 by the integral Γ(α) = Furthermore, the mean and the variance of X are: 0 x α 1 e x dx. E[X ] = α λ Var(X ) = α λ 2. Remark: Sometimes the Gamma distribution is described in terms of the shape parameter α and a scale parameter θ = 1/λ. Jay Taylor (ASU) APM 504 Fall 2013 39 / 65

Random Variables Continuous Distributions The Gamma function introduced on the preceding slide is an important object in its own right and appears in many settings in mathematics and statistics. Although Γ(α) usually cannot be explicitly evaluated, the following identity is often useful: Γ(α + 1) = 0 x α e x dx = x α ( e x ) 0 = αγ(α). In particular, if α = n is an integer, then + 0 αx α 1 e x dx Γ(n + 1) = nγ(n) = n(n 1)Γ(n 2) = = n!γ(1) = n! since Γ(1) = 1. Thus the Gamma function can be regarded as a smooth extension of the factorial function to the positive real numbers. Jay Taylor (ASU) APM 504 Fall 2013 40 / 65

Random Variables Continuous Distributions The Gamma distribution is related to the exponential distribution in much the same way that the negative binomial distribution is related to the geometric distribution. 3 Gamma Distributions (rate =1) Other applications include: p(x) 2.5 2 1.5 1 0.5 Exponential(1) a =0.5 a =1 a =2 a =5 0 0 1 2 3 4 5 6 7 8 9 10 x When α = 1, the Gamma distribution reduces to the exponential. If X 1,, X n are independent exponential RV s with rate λ, then their sum X = X 1 + + X n is a Gamma RV with shape parameter α = n and rate λ. Thus the Gamma distribution is often used to model random lifespans that elapse after a series of independent events. Jay Taylor (ASU) APM 504 Fall 2013 41 / 65

Random Variables Continuous Distributions Normal Distributions X is said to have the normal (or Gaussian) distribution with mean µ and variance σ 2 > 0, written X N (µ, σ 2 ), if X takes values in R with probability density function p(x) = 1 σ 2 /2σ 2 2π e (x µ). In this case, the mean and the variance of X are µ and σ 2, as implied by the name of the distribution. Furthermore, if µ = 0 and σ 2 = 1, then X is said to be a standard normal random variable. Applications: Normal distributions arise as the limiting distribution of a sum of a large number of independent random variables. This is the content of the Central Limit Theorem. Many quantities encountered in biological systems are approximately normally distributed, including many morphological traits. Jay Taylor (ASU) APM 504 Fall 2013 42 / 65

Random Variables Continuous Distributions Some useful properties of the normal distribution include: Sums of independent normal RV s are normal: If X 1,, X n are independent normal RV s and X i N (µ i, σ 2 i ), then their sum X = X 1 + + X n is a normal random variable with mean µ 1 + + µ n and variance σ 2 1 + + σ 2 n. 0.8 Normal Distributions p(x) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 σ 2 =0.25 σ 2 =1 σ 2 =4 σ 2 =25 Linear transforms preserve normality. If X N (µ, σ 2 ), then Y = ax + b is normal with mean aµ + b and variance a 2 σ 2. If Z N (0, 1), then X = σz + µ N (µ, σ 2 ). 0 10 8 6 4 2 0 2 4 6 8 10 x Jay Taylor (ASU) APM 504 Fall 2013 43 / 65

Multivariate Distributions Multivariate Distributions and Random Vectors Joint and Marginal Distributions Suppose that X 1,, X n are random variables defined on a common probability space (Ω, F, P) with values in the sets E 1,, E n, respectively. Then the joint distribution of these variables is the distribution of the random vector X = (X 1,, X n) on the set E = E 1 E n, i.e., P(X A) = P((X 1,, X n) A) for A E. Furthermore, the distribution of each variable X i considered on its own is said to be the marginal distribution of that variable. The joint distribution of a collection of random vectors tells us how the variables are related to one another. Jay Taylor (ASU) APM 504 Fall 2013 44 / 65

Multivariate Distributions Example: If each of the variables X 1,, X n is marginally discrete, then the joint distribution is uniquely determined by the joint probability mass function p X1,,X n : E [0, 1], which is defined by p X1,,X n (x 1,, x n) = P(X 1 = x 1,, X n = x n). In this case, the marginal probability mass function of X i can be recovered by summing the joint probability mass function over all possible values of the remaining values: p Xi (y) = p X1,,X n (x 1,, x i 1, x, x i+1,, x n). x E:x i =y Jay Taylor (ASU) APM 504 Fall 2013 45 / 65

Multivariate Distributions Multinomial Distribution Let n 1 and let p 1,, p k be a collection of non-negative real numbers such that p 1 + + p k = 1. We say that the random vector X = (X 1,, X k ) has the multinomial distribution with parameters n and (p 1,, p k ) if each of the variables X i takes values in the set {0,, n} and if the joint probability mass function of these variables is given by ( ) provided n 1 + + n k = n. p(n 1,, n k ) = n n 1,, n k p n 1 1 pn k k Application: As the name suggests, the multinomial distribution generalizes the binomial distribution. Suppose that we conduct n IID trials and that each trial can result in one of k possible outcomes, which have probabilities p 1,, p k. If X i denotes the number of trials that result in the i th outcome, then (X 1,, X k ) has the multinomial distribution with parameters n and (p 1,, p k ). Jay Taylor (ASU) APM 504 Fall 2013 46 / 65

Multivariate Distributions Joint Continuity A collection of real-valued random variables X 1,, X n is said to be jointly continuous if there is a function f : R n [0, ], called the joint density function, such that for every set A = [a 1, b 1] [a n, b n] P((X 1,, X n) A) = f (x 1,, x n)dx 1 dx n. A In this case, each variable X i is marginally continuous and the marginal density functions can be recovered by integrating the joint density function, e.g., f X1 (y) = f (y, x 2,, x n)dx 2 dx n. R n 1 Jay Taylor (ASU) APM 504 Fall 2013 47 / 65

Multivariate Distributions Dirichlet Distribution Suppose that k 2 and let α 1,, α k be a collection of positive real numbers with sum α = α 1 + + α k. We say that the random vector X = (X 1,, X k ) has the Dirichlet distribution with parameters k and (α 1,, α k ) if X takes values in the (k 1)-dimensional simplex { } k k 1 = (x 1,, x k ) : x 1,, x k 0 and x i = 1 with joint density function ( ) k Γ(α) p(x 1,, x k ) = k i=1 Γ(α x α i 1 i. i) i=1 i=1 When k = 2, the Dirichlet distribution reduces to the Beta distribution on the segment {(x, 1 x) : 0 x 1}. Notice that k 1 can be identified with the set of probability distributions on the discrete set {1,, k}. Jay Taylor (ASU) APM 504 Fall 2013 48 / 65

Multivariate Distributions When working with stochastic processes, it is often useful to consider the conditional distribution of one set of variables given knowledge of another set of variables. Conditional Distributions 1 Suppose that X and Y are discrete random variables with joint probability mass function p X,Y. Then the conditional distribution of X given Y = y is provided p Y (y) > 0. P (X = x Y = y) = p X,Y (x, y) p Y (y) 2 Similarly, if X and Y are jointly continuous with joint density function f X,Y, then conditional on Y = y, X is conditionally continuous with condition density function f X Y (x y) = f X,Y (x, y) f Y (y) provided f Y (y) > 0. Jay Taylor (ASU) APM 504 Fall 2013 49 / 65

Multivariate Distributions Although we can always recover the marginal distributions of a collection of random variables from their joint distribution, the inverse operation usually is not possible without additional information. One case where this is possible is when the variables are independent. Independence of Random Variables 1 The random variables X 1,, X n are said to be independent if P(X 1 E 1,, X n E n) = n P(X i E i ) 1 for all sets E 1,, E n such that the events {X i E i } are well-defined. 2 An infinite collection of random variables is said to be independent if every finite sub-collection is independent. In other words, X and Y are independent if and only if the events {X E} and {Y F } are independent for all subsets E and F of the ranges of X and Y. Jay Taylor (ASU) APM 504 Fall 2013 50 / 65

Multivariate Distributions As a general rule, calculations involving multivariate distributions are greatly simplified when the component variables are independent. The following theorem provides an important example of this principle. Theorem Suppose that X 1,, X n are independent real-valued random variables and that f 1,, f n are functions from R to R. Then f 1(X 1),, f n(x n) are independent real-valued random variables and [ n ] n E f i (X i ) = E [f i (X i )], i=1 whenever the expectations on both sides of the identity are defined. In particular, by letting each f i (x) = x be the identity function we obtain the following important special case: [ n ] n E X i = E [X i ]. i=1 i=1 i=1 Jay Taylor (ASU) APM 504 Fall 2013 51 / 65

Multivariate Distributions Of course, it is often the case that variables of interest are not independent and then we need metrics to quantify the extent to which they depend on each other. One such metric is the covariance. Covariance Suppose that X and Y are real-valued random variables defined on the same probability space. The covariance of X and Y is the quantity Cov(X, Y ) = E [(X E[X ])(Y E[Y ])] = E[XY ] E[X ]E[Y ]. The covariance between two random variables is a measure of their linear association. In particular, if X and Y are independent, then since the theorem on the previous slide shows that E[XY ] = E[X ]E[Y ], it follows that Cov(X, Y ) = 0. However, the converse of this result is not true: the mere fact that two variables have zero covariance does not imply that they are independent. Jay Taylor (ASU) APM 504 Fall 2013 52 / 65

Multivariate Distributions Properties of Covariance Covariances have a number of useful properties: Cov(X, X ) = Var(X ); Cov(X, Y ) = Cov(Y, X ); Cov(aX, by ) = ab Cov(X, Y ); Bilinearity: ( n ) m Cov X i, Y j = Cauchy-Schwartz inequality: i=1 j=1 n m Cov(X i, Y j ); i=1 j=1 Cov(X, Y ) Var(X ) Var(Y ). Exercise: Verify the above properties. Jay Taylor (ASU) APM 504 Fall 2013 53 / 65

Multivariate Distributions When we work with random vectors containing more than two variables, it is often useful to organize the covariances into a single matrix. Variance-Covariance Matrix Suppose that X 1,, X n are real-valued random variables and let σ ij = Cov(X i, X j ) denote the covariance of X i and X j. Then the variance-covariance matrix of the random vector X = (X 1,, X n) is the n n matrix Σ with entry σ ij in the i th column and j th row: σ 11 σ 12 σ 1n σ 21 σ 22 σ 2n Σ =.... σ n1 σ n2 σ nn Because Cov(X, Y ) = Cov(Y, X ), it is clear that every variance-covariance matrix is symmetric. Furthermore, it can be shown that any such matrix is also non-negative definite, i.e., given any (column) vector β R n, we have ( n ) β T Σβ = Var β i X i 0. i=1 Jay Taylor (ASU) APM 504 Fall 2013 54 / 65

Multivariate Distributions Multivariate Normal Distribution A continuous random vector X = (X 1,, X n) with values in R n is said to have the multivariate normal distribution with mean vector µ = (µ 1,, µ n) and n n variance-covariance matrix Σ if it has joint density function { p(x) = (2π) n/2 Σ 1/2 exp 1 } 2 (x µ)t Σ 1 (x µ). Here Σ is a positive-definite symmetric matrix with determinant Σ > 0 and inverse Σ 1. If X is multivariate normal, then every component variable X i is a normal random variable. Furthermore, if X and Y are multivariate normal random variables with values in R n, then so is X + Y, i.e., sums of multivariate normal random variables are also multivariate normal. Jay Taylor (ASU) APM 504 Fall 2013 55 / 65

Sums of Random Variables Sums of Independent Random Variables Many problems in applied probability and statistics involve sums of independent random variables. For example, in discrete-time branching processes, the size of a population at time t + 1 is equal to the sum of the number of offspring born to each adult female alive at time t. In some cases, the distribution of the sum of two independent integer-valued random variables can be calculated with the help of the following result. Theorem Suppose that X and Y are independent integer-valued random variables with probability mass functions p X and p Y, respectively. Then Z = X + Y is an integer-valued random variable with probability mass function p Z (n) = p X p Y (n) k= p X (k)p Y (n k) The operation p X p Y is called the discrete convolution of p X and p Y. Jay Taylor (ASU) APM 504 Fall 2013 56 / 65

Sums of Random Variables The proof of this result uses only elementary properties of probability distributions and independence: p Z (n) = P(Z = n) = P(X + Y = n) ( ) = P {X = k, Y = n k} k= k= = P(X = k, Y = n k) = P(X = k)p(y = n k) k= = p X (k)p Y (n k). k= Jay Taylor (ASU) APM 504 Fall 2013 57 / 65

Sums of Random Variables By way of example, we can show that the sum of two independent Poisson random variables is Poisson. Suppose that X Poisson(λ) and Y Poisson(µ) are independent and let Z = X + Y. Then the probability mass function of Z is p Z (n) = = k= p X (k)p Y (n k) n ) (e λ λ (e k µ µ n k k=0 = e (λ+µ) 1 n! k! n k=0 = e (λ+µ) (λ + µ) n n! (n k)! ) n! k!(n k)! λk µ n k which shows that X + Y Poisson(λ + µ). This result is very useful when working with Poisson processes. Jay Taylor (ASU) APM 504 Fall 2013 58 / 65

Sums of Random Variables An analogous result holds for sums of independent continuous random variables. Theorem Let X and Y be independent continuous random variables with densities p X and p Y, respectively. Then Z = X + Y is a continuous random variable with density p Z (z) = p X p Y (z) p X (t)p Y (z t)dt and p X p Y (z) is called the convolution integral of p X and p Y. Exercise: Use this theorem to show that the sum of two independent exponential random variables with rate parameter λ is Gamma distributed with shape parameter α = 2 and rate parameter λ. Jay Taylor (ASU) APM 504 Fall 2013 59 / 65

Sums of Random Variables Two of the most important results in probability theory concern the asymptotic or limiting behavior of a sum of independent random variables as the number of terms tends to infinity. The Strong Law of Large Numbers (SLLN) Suppose that X 1, X 2, is a sequence of independent and identically-distributed (iid) real-valued random variables with E X 1 <. If µ = EX 1 and S n = X 1 + + X n, then the sequence of sample means S n/n converges almost surely to µ, i.e., { P lim n 1 n Sn = µ } = 1. Interpretation: As the number of independent trials increases, the sample mean is certain to converge to the true mean. Jay Taylor (ASU) APM 504 Fall 2013 60 / 65

Sums of Random Variables The SLLN is an example of a more general heuristic, which states that deterministic behavior can emerge in random systems containing a large number of weakly interacting components. For example, it is this heuristic which justifies the use of deterministic ODE s to model chemical reactions. However, in many instances, we are interested in the fluctuations of the system about this deterministic limit. This is addressed by the Central Limit Theorem. Central Limit Theorem (CLT) Suppose that X 1, X 2, is a sequence of i.i.d. real-valued random variables with finite mean µ and finite variance σ 2. If S n = X 1 + + X n and Z n = 1 σ (Sn nµ), n then the sequence Z 1, Z 2, converges in distribution to a standard normal random variable Z, i.e., for every t (, ), t lim P n (Zn t) = P(Z t) = 1 e x2 /2 dx. 2π Jay Taylor (ASU) APM 504 Fall 2013 61 / 65

Sums of Random Variables The variables Z 1, Z 2, introduced in the CLT are said to be standardized in the sense that for every n 1, EZ n = 0 and Var(Z n) = 1. In other words, since we already know that the sample means S n/n are converging almost surely to the mean µ, to study the fluctuations of S n/n around this limit we need to subtract the limit and then amplify the differences S n/n µ by a factor that grows rapidly enough to compensate for the fact that this difference is tending to 0: Z n = n σ ( ) 1 n Sn µ. What the CLT tells us is that irrespective of the distribution of the X i s, these amplified differences will be approximately normally distributed when n is large. This observation presumably explains why so many quantities are approximately normally distributed - in effect, the microscopic details are lost to the Gaussian limit whenever we consider a macroscopic system in which the components act additively. Jay Taylor (ASU) APM 504 Fall 2013 62 / 65

Sums of Random Variables Example: Suppose that X 1, X 2, are independent Bernoulli random variables with success probability p and let S n = X 1 + + X n be the sum of the first n variables. Since S n is the number of successes in a series of n independent trials, we know that S n is a binomial random variable with parameters n and p. Furthermore, by the CLT, we know that when n is large, the sums, Z n = 1 np(1 p) (S n np) are approximately normally distributed. However, since linear transformations of a normal random variable are normal, it follows that S n is itself approximately normally distributed with mean np and variance np(1 p), i.e., Binomial(n, p) N (np, np(1 p)). Remark: This observation can be used to construct a fast approximate algorithm for sampling from a binomial distribution. Jay Taylor (ASU) APM 504 Fall 2013 63 / 65

Sums of Random Variables The convergence of the binomial distribution to the normal distribution with increasing n is illustrated in the below figure which compares the normal distribution with mean np and variance np(1 p) with the binomial distribution for n = 10 (left) and n = 100 (right) when p = 0.2. Jay Taylor (ASU) APM 504 Fall 2013 64 / 65

Sums of Random Variables Another example of the scope of the CLT is provided by the distribution of adult human heights, which is approximately normal. The figure shows a histogram for the heights of a sample of 5000 adults (source: SOCR), as well as the best fitting normal distribution. Normality of quantitative traits can be explained by Fisher s infinitesimal model: The trait depends on a large number L of variable loci. 250 200 Distribution of Adult Heights data normal The two alleles at each locus have a small effect X l,m and X l,p on the trait. 150 The loci act additively. 100 Then an individual s height may be expressed as below: 50 H = 0 60 62 64 66 68 70 72 74 76 height (inches) L (X l,m + X l,p ) + ɛ. where ɛ is the random environmental effect on height. l=1 Jay Taylor (ASU) APM 504 Fall 2013 65 / 65