Lecture 6: Special probability distributions. Summarizing probability distributions. Let X be a random variable with probability distribution

Size: px

Start display at page:

Download "Lecture 6: Special probability distributions. Summarizing probability distributions. Let X be a random variable with probability distribution"

Emory Copeland
6 years ago
Views:

1 Econ 514: Probability and Statistics Lecture 6: Special probability distributions Summarizing probability distributions Let X be a random variable with probability distribution P X. We consider two types of probability distributions Discrete distributions: P X is absolutely continuous with respect to the counting measure. Continuous distributions: P X is absolutely continuous with respect to the Lebesgue measure. In both cases there is a density f X. Initially we consider X scalar and later random vectors X. 1

2 How do we summarize the probability distribution P X? Obvious method: Make graph of density. Figure 1 for discrete case and figures 2,3 for continuous case. 2

3 3

4 4

5 Graph can be used to visualize support and to compute probabilities. Intervals where f X is large have a high probability. Summarizing using moments We can also try to summarize P X by numbers. This never gives a complete picture, because we summarize a function f X by some number. Obvious choice is E(X), the expected value of X and the mean of the distribution P X. Interpretation: average over repetitions. Repeat the random experiment N times and call the outcomes X 1, X 2, X 3,..., X N. If N is large then 1 N N i=1 X i E(X), i.e. the mean is the average over repetitions. 5

6 Interpretation: optimal prediction. Consider predictor m of outcome X. Prediction error of this predictor X m Assume that the loss function is proportional to (X m) 2 i.e. proportional to squared prediction error. Optimal predictor minimizes expected loss E((X m) 2 ) = E ( ((X E(X)) + (E(X) m)) 2) = = E ( (X E(X)) 2 + 2(X E(X))(E(X) m) + (E(X) m) 2) = = E ( (X E(X)) 2 + (E(X) m) 2) which is minimal if m = E(X). Special case: If f X is symmetric around µ, i.e. f X (µ+ x) = f X (µ x), then, if E( X ) <, we have E(X) = µ. E(X) can be outside the support of X: see figure 3. Implication for prediction? 6

7 The mean E(X) is a measure of location of the distribution P X. A measure of dispersion is the variance of X defined by Var(X) = E ( (X E(X)) 2) Interpretation clear in discrete case Var(X) = i (x i E(X)) 2 f X (x i ) with x i E(X) deviation from the mean and f X (x i ) the weight, i.e. the probability of the deviation. We have Var(X) = E ( (X E(X)) 2) = E ( X 2 2XE(X) + E(X) 2) = = E(X 2 ) 2E(X) 2 + E(X) 2 = E(X 2 ) E(X) 2 Useful in computations. 7

8 Often we use µ or µ X for E(X) and σ 2 or σ 2 X for Var(X). The standard deviation of (the distribution of) X, often denoted by σ X is defined by σ X = Var(X) Example: Picking a number at random from [ 1, 1]. f X (x) = I [ 1,1] (x) 1 2 By symmetry E(X) = 0. Variance is equal to E(X 2 ) 1 Var(X) = σx 2 x 2 = 1 2 dx = x3 = Standard deviation σ X = 1 3 8

9 Mean and variance are determined by E(X) and E(X 2 ). These are the first two moments of the distribution of X. In general the k-th moment, often denoted by µ k, is µ k = E(X k ) We can also define the cumulants, i.e the moments around E(X) = µ m k = E ( (X µ) k) The third cumulant is called skewness and the fourth kurtosis. If the distribution is symmetric then skewness is 0 (see exercise). Kurtosis is a measure of peakedness (useful if distribution is symmetric). 9

10 More moments means more knowledge about distribution. What if w know all moments µ k, k = 1, 2,...? Useful too to obtain moments is the moment generating function of X, denoted by M X (t) and defined by M X (t) = E ( e tx) if this expectation exist for h < t < h, h > 0. Obviously M X (0) = 1. Take the derivative with respect to t and interchange integration and differentiation dm X (t) = dt xe tx f X (x)dx For a non-negative random variable this is allowed if Why? E ( Xe hx) < Hence dm X (0) = E(X) dt 10

11 In general so that d k M X (t k ) = x k e tx f X (x)dx dt d k M X dt k (0) = E(X k ) The moments E(X k ) do not uniquely determine the distribution of X. Casella and Berger give counterexample. With some further assumptions the moments do determine the distribution If the distributions of X and Y have bounded support, then they are the same if and only if the moments are the same. If the moment generating functions M X, M Y exist and if they are equal for h < t < h, then X and Y have the same distribution. 11

12 We can also consider the characteristic function m X (t) Because m X (t) = E ( e itx) e itx = sin(tx) + i cos(tx) the characteristic function always exists. There is a 1-1 correspondence between characteristic functions and distributions. 12

13 Special distributions There is a catalogue of standard distributions P X of random variables X. Often a random experiment that we encounter in practice is such that we are interested in the associated random variable X with such a standard distribution. Choosing such a standard distribution is the selection of a mathematical model for a random experiment, described by the probability space (R, B, P X ). Often P X depends on parameters that have to be chosen in order to have a fully specified mathematical model. Description of special distributions (i) In what type of random experiments can the standard distribution be used? (ii) Mean, variance, mgf (if exists). (iii) Shape of density, i.e. graph of density. 13

14 Discrete distributions Discrete uniform distribution Consider a random experiment with a finite number of outcomes that without loss of generality can be labeled 1,..., N. If outcomes are equally likely P X has a density with respect to the counting measure f X (x) = Pr(X = x) = 1 N, x = 1,..., N = 0, elsewhere This the discrete uniform distribution. This distribution has one parameter N. Moments etc. only have meaning if the outcomes 1,..., N are not just labels, but are a count. Moment generating function M X (t) = 1 N N e tk = 1 N k=1 et1 etn 1 e t Using N k=1 k = 1 2 N(N+1) and N k=1 k2 = N(N+1)(2N+1) 6, we have E(X) = 1 N k = N + 1 N 2 E(X 2 ) = 1 N k=1 N k 2 = k=1 (N + 1)(2N + 1) 6 14

15 so that Var(X) = E(X 2 ) E(X) 2 = (N + 1)(N 1) 12 Bernoulli distribution Random experiment has two outcomes that we label 0 and 1. Denote Pr(X = 1) = p. P X has a density with respect to the counting measure f X (x) = p x (1 p) 1 x, x = 0, 1 = 0, elsewhere This is the Bernoulli distribution. There is one parameter p with 0 p 1. The mgf is and M X (t) = pe t + 1 p E(X) = p E(X 2 ) = E(X) = p Var(X) = E(X 2 ) E(X) 2 = p p 2 = p(1 p) Binomial distribution Consider sequence of independent Bernoulli random experiments (or trials). Define X as the number of 1-s in n trials. Consider the event X = x. 15

16 For this event x trials must have outcome 1 and n x outcome 0. Sequence with x 1-s and n x 0-s is e.g The probability of this sequence is p x (1 p) n x. ( ) n There are sequences of 0-s and 1-s that x have the same probability, so that ( ) n Pr(X = x) = p x (1 p) n x x Hence P X has a density with respect to the counting measure ( ) n f X (x) = p x (1 p) n x, x = 0, 1,..., n x = 0, elsewhere This the Binomial distribution. Notation Binomial formula X B(n, p) (a + b) n = n a k b n k k=0 We use this formula to establish The density sums to 1 n ( ) n p x (1 p) n x = (p + (1 p)) n = 1 x x=0 16

17 The mgf is = M X (t) = n x=0 ( n x n x=0 ( n x ) e tx p x (1 p) n x = ) (pe t ) x (1 p) n x = ( pe t + 1 p ) n Using the mgf we find E(X) = dm X (0) = n ( pe t + 1 p ) n 1 pe t = np dt t=0 E(X 2 ) = d2 M X (0) = n ( pe t + 1 p ) n 1 pe t dt 2 + t=0 + n(n 1) ( pe t + 1 p ) n 2 p 2 e 2t t=0 = np+n(n 1)p 2 so that Var(X) = E(X 2 ) E(X) 2 = np(1 p) Let Y k be the outcome of the k-th Bernoulli trial, so that n X = k=1 with the Y k, k = 1,..., n stochastic independent. This implies that Y k E(X) = ne(y 1 ) = np Var(X) = nvar(y 1 ) = np(1 p) M X (t) = (M Y1 (t)) n = ( pe t + 1 p ) n Shape of the density f X 17

18 We have f X (x) f X (x 1) = 1 + (n + 1)p x p(1 p) We conclude that f X is increasing for x < (n+1)p and decreasing for x > (n + 1)p. If p > n n+1 then f X is increasing for x = 0,..., n and if p < 1 n+1 then f X is decreasing for x = 0,..., n. Otherwise f X is increasing/decreasing. The value of x that maximizes f X is called the mode of the distribution of X. For the binomial distribution the mode is the largest integer less than or the smallest integer greater than (n + 1)p. The binomial distribution has two parameters n, p with n a positive integer and 0 p 1. Example: sampling Let p be fraction of households in US with income less than $15000 per year. Select N households at random from the population. Define X is number of households among the n selected with income less than $ The distribution of X is binomial, if the selections of households are independent. This is true if the selection is done with replacement and approximately true if the population is sufficiently large. 18

19 Assume n = 100 and 16 households have an income less than $ Now 16 is an estimate of E(X) and this suggests that it is reasonable to guess that ˆp = 16 n =.16 or 16% of the US households has an income less than $

20 Hypergeometric distribution In the example we assumed (counterfactually) that selection was with replacement. Now consider a population of size N from which we select a sample of size n without replacement. In the population M households have an income of less that $ X is number of households among the n selected with income less than $ X = x iff we select x household from the M with ( an) income M of less that $15000: can be done in ways. x we select the remaining n x households from the N M ith an income ( greater than ) or equal N M to $15000: can be done in ways. n x 20

21 The total number of selections (without replacement) of ( n households ) from the population of N households N is. n Combining these results we have ( ) ( ) M N M x n x Pr(X = x) = ( ) N n 21

22 The distribution P X has a density with respect to the counting measure ( ) ( ) M N M f X (x) = x n x ( ) N, x = 0,..., n n = 0, otherwise The distribution P X is the Hypergeometric distribution. It can be shown (see Casella and Berger) E(X) = n M N Var(X) = N ( 1 n ) n M N 1 N N ( 1 M ) N Compare these results to those for the binomial distribution. 22

23 Geometric distribution Consider a sequence of independent Bernoulli random experiments with probability of outcome 1 equal to p. Call outcome 1 a success and outcome 0 a failure. Define X as the number of experiments before the first success. X = x iff the outcomes for x + 1 Bernoulli experiments are where there are x leading 0-s. Hence Pr(X = x) = (1 p) x p 23

24 P X has a density with respect to the counting measure f X (x) = (1 p) x p, x = 0, 1,... = 0, otherwise The distribution P X is called the Geometric distribution. This distribution has one parameter p with 0 p 1. The mgf is M X (t) = E ( e tx) = = p x=0 e tx 1 p) x p = x=0 ( (1 p)e t ) x = p 1 (1 p)e t 24

25 From the mgf we find so that E(X) = dm X (0) = 1 p dt p E(X 2 ) = d2 M X (0) = 1 p + dt 2 p 2 Var(X) = 1 p p 2 ( 1 p Sometimes we define X 1 is the number of Bernoulli experiments needed for first success. Then and e.g. X 1 = X + 1 M X1 (t) = E ( ) e tx 1 = e t te ( e tx) p ) 2 25

26 Example of geometric distribution: Consider a job seeker and let p be the probability of receiving a job offer in any week The week in which the first offer is received has the distribution P X1. We have for x 2 x 1 Pr(X 1 > x 2 X 1 > x 1 ) = Pr(X 1 > x 2 ) Pr(X 1 > x 1 ) = = x=x 2 +1 (1 p)x p x=x 1 +1 (1 p)x p = (1 p)x 2 (1 p) x 1 = (1 p) x 2 x 1 = = (1 p) x p = Pr(X 1 > x 2 x 1 ) x=x 2 x 1 +1 Conclusion: If the job seeker has waited x 1 weeks the probability that he/she has to wait another x 2 x 1 weeks is the same as the probability of waiting x 2 x 1 weeks from the beginning of the job search. The geometric distribution has no memory. 26

27 Negative binomial distribution Setup as for the geometric distribution. Define X as the number of failures before the r-th success. X = x iff trial x + r is success (event A) and in previous x+r 1 trials r 1 successes and x failures. Because the events A and B depend on independent random variables P (A B) = P (A)P (B). P (A) = p A sequence with r 1 successes and x failures has probability p r 1 (1 p) x. Because we can ( choose the ) x + r 1 r 1 successes in the x+r 1 trials in r 1 ways, this is the number of such sequences. Hence ( ) x + r 1 P (B) = p r 1 (1 p) x r 1 27

28 Combining Pr(X = x) = p ( x + r 1 r 1 ) p r 1 (1 p) x P X has density with respect to the counting measure ( ) x + r 1 f X (x) = p p r 1 (1 p) x r 1, x = 0, 1,... = 0, otherwise This is the Negative binomial distribution. The parameters are r (integer) and p with 0 p 1. 28

29 Poisson distribution Poisson distribution applies to number of occurrences of some event in a time interval of finite length, e.g. number of job offers received by job seeker in a month. Offers can arrive at any moment (in continuous time). Compare with the geometric distribution. Define X(a, b) as the number of offers in [a, b). The symbol o(h) (small o of h) denotes any function with lim h 0 o(h) h = 0. Assumptions (i) Pr(X(s, s + h) = 1) = λ(h) + o(h) (ii) Pr(X(s, s + h) 2) = o(h) (iii) X(a, b) and X(c, d) are independent if [a, b) [c, d) =. 29

30 Consider [0, t) and divide into n intervals with length h = t n. Then (neglect probabilities that are of order o(h)) ( ) n Pr(X(0, t) = k) = (λh) k (1 λh) n k = k ( ) k ( n = 1 λ t n k = k n) Now ) ( λ t n = 1 k! (λt)k ( 1 λt n ( lim 1 λt n n lim n ) n k = lim n ) n k n... (n k + 1) n... n n... (n k + 1) = 1 n... n ( 1 λt n ) k ( lim 1 λt ) n = e λt n n Conclusion: for n and if we write X for X(0, t) Pr(X = k) = e λt(λt)k k! 30

31 The distribution P X has a density with respect to the counting measure θ θx f X (x) = e, x = 0, 1,... x! = 0, otherwise The distribution P X is the Poisson distribution. It has one parameter θ > 0. Notation X Poisson(λ) The mgf is M X (t) = so that x=0 e tx θ θx e x! = e θ (e t θ) x x=0 x! = e (et 1)θ E(X) = θ E(X 2 ) = θ 2 + θ and Note E(X) = Var(X). Var(X) = θ 31

32 Continuous distributions Uniform distribution Random experiment: pick a number at random from [a, b]. P X ([a, x]) = x a = x a dx Hence P X measure has a density with respect to Lebesgue 1 f X (x) =, a x b b a = 0, otherwise P X is the Uniform distribution on [a, b]. Notation X U[a, b] We have M X (t) = ebt e at (b a)t E(X) = a + b 2 (b a)2 Var(X) = 12 32

33 Graph of density 33

34 Normal distribution The distribution P X has density with respect to the Lebesgue measure f X (x) = 1 σ 2π e 1 2σ 2 (x µ)2, < x < The mgf is M X (t) = E ( e tx) ) = e tµ E (e t(x µ) = Now = e tµ 1 σ 2π et(x µ) e 1 2σ 2 (x µ)2 dx = = e tµ 1 σ 1 2π e 2σ 2 ((x µ) 2 2σ 2 t(x µ)) dx (x µ) 2 2σ 2 t(x µ) = (x µ) 2 2σ 2 t(x µ)+σ 4 t 2 σ 4 t 2 = so that M X (t) = e tµ+ 1 2 σ2 t 2 = (x µ σ 2 t) 2 σ 4 t 2 1 σ 2π e 1 2σ 2 (x µ σ2 t) 2 dx = e tµ+ 1 2 σ2 t 2 34

35 From the mgf E(X) = µ E(X 2 ) = σ 2 + µ 2 so that Var(X) = σ 2 The distribution P X is the Normal distribution with mean µ and variance σ 2. Notation X N(µ, σ 2 ) 35

36 Define Then Z = X µ σ E(Z) = 0 Var(Z) = 1 Hence Z has a normal distribution with µ = 0, σ 2 = 1. This is the standard normal distribution with density φ(x) = 1 e 1 2 x2, < x < 2π and cdf Φ(x) = x φ(s)ds We can compute the probability of an interval [a, b] with the standard normal cdf ( a µ Pr(a X b) = Pr Z b µ ) = σ σ ( ) ( ) b µ a µ = Φ Φ σ σ 36

37 Shape of normal density: bell curve 37

38 Why is the normal distribution so popular? Galton s quincunx or dropping board Magnified Define X n position (relative to 0) after n rows of pins. If Z n takes values -1 and 1 and gives the direction at row n, then X n = Z Z n 38

39 If n is large then X n has approximately the normal distribution. Central limit theorem: Sum of many independent small effects gives normal distribution. 39

40 Exponential distribution Consider waiting time to an event that can occur at any time (compare with geometric distribution). Define the hazard or failure rate by Pr(event in [t, t + dt] event after t) = = Pr(t X < t + dt X t) = f X(t)dt 1 F X (t) Assume f X (t) 1 F X (t) = λ Then the solution to is obtained by integration f X (t) = λe λt 40

41 The distribution P X has a density with respect to the Lebesgue measure f X (x) = λe λx, x 0 = 0, otherwise P X has the Exponential distribution. There is one parameter λ > 0 and the notation is X Exp(λ) The mgf is and hence M X (t) = λ λ t E(X) = 1 λ var(x) = 1 λ 2 41

42 Note for t s Pr(X > t X > s) = Pr(X > t) Pr(X > s) = e λt e λs = e λ(t s) If you have waited s, the probability of an additional wait of t s is the same as if the wait had started at time 0. As the geometric distribution the exponential distribution has no memory. If X is length of human life, compare Pr(X > 40 X > 30) and Pr(X > 70 X > 60) Connection with Poisson distribution: If event is recurrent and waiting time has exponential distribution with parameter λ, then number of occurrences in [0, t] has a Poisson distribution with parameter λt. 42

43 Gamma distribution The Gamma distribution is the distribution of X = Y Y r with Y k independent exponential random variables with parameter λ. X is the waiting time to the r-the occurrence of the event. Compare with negative binomial distribution. The distribution P X has a density with respect to the Lebesgue measure f X (x) = λ Γ(r) (λx)r 1 e λx, x 0 = 0, otherwise with Γ the Γ function. Γ(r) = (r 1)! if r is a positive integer and otherwise it has to be computed numerically. 43

44 This is the Gamma distribution with parameters λ, r > 0. r need not be an integer. Notation X Γ(λ, r) The mgf is so that M X (t) = ( ) r λ t < λ λ t E(X) = r λ Var(X) = r λ 2 44

45 Lognormal distribution Let Y N(µ, σ 2 ) and define X = e Y. The distribution P X has density 1 f X (x) = xσ 1 2π e 2σ 2 (ln x µ)2, x 0 = 0, otherwise Derive this density. This is the Lognormal distribution with parameters µ and σ 2. The mean and variance can be derived from the mgf of the normal distribution E(X) = e µ+ 1 2 σ2 Var(X) = e 2µ+2σ2 e 2µ+σ2 What is E(ln X) and Var(ln X)? 45

46 Cauchy distribution A random variable that has a distribution with density with respect to the Lebesgue measure f X (x) = 1 ( πβ x α β ) 2, < x < has a Cauchy distribution with parameters α and β > 0. The density is symmetric around α. This is the median of X. E(X) does not exist and var(x) =. The mgf is for t 0. 46

47 Chi-squared distribution The chi-squared distribution is a special case of the Γ distribution: set r = k 2 and λ = 1 2. The density is 1 f X (x) = Γ ( ) k k x k 2 1 e x 2, x = 0, otherwise The parameter k is called the degrees od freedom of the distribution. The chi-squared distribution is important because of the following result: If X has a standard normal distribution, then Y = X 2 has a chi-squared distribution with k = 1. 47

48 We derive the mgf ( M Y (t) = E e tx2) = = 1 1 2t 1 2π 1 1 2π e tx2 1 2 x2 dx = e t 1 2t x 2 dx = ( = = 1 2t 1 2 t which is the mgf of the Γ distribution with r = 1 2 and λ = 1 2, i.e. the chi-squared distribution with k = 1. )

49 Exponential family of distributions The exponential family of densities are the densities that can expressed as f X (x) = h(x)c(θ)e k i=1 w i(θ)t i (x). < x < Note that c, w i, i = 1,..., k do not depend on x and h, t i, i = 1,..., k do not depnd on θ. θ is the vector of parameters of the distribution. Why useful: We will see that if we have data from an exponential family distribution, the information can be summarized by t i, i = 1,..., k. Examples (i) Binomial distribution: For x = 0,..., n ( ) ( ) n n f X (x) = p x (1 p) n x = (1 p) n e x ln ( 1 p) p x x Hence h(x) = ( n x ) t(x) = x c(θ) = (1 p) n ( ) p w(θ) = ln 1 p 49

50 (ii) Normal distribution: For < x < f X (x) = 1 σ 2π e 1 2σ 2 (x µ)2 = 1 Hence σ 2π e h(x) = 1 t 1 (x) = x 2 t 2 (x) = x µ 2 2σ 2 e 1 2σ 2 x2 + µ σ 2 x c(θ) = 1 σ µ 2 2π e 2σ 2 w 1 (θ) = 1 w 2σ 2 2 (θ) = µ σ 2 Other exponential family distributions: Poisson, exponential, Gamma. The density of the uniform distribution is f X (x) = 1 I(a x b) b a The function I(a x b) cannot be factorized in a function of x and a, b. Hence it does not belong to the exponential family. 50

51 Multivariate distributions: recapitulation Consider a probability space (Ω, A, P ) and define a vector of random variables or random vector X as a function X : Ω R K, i.e. X 1 (ω) X(ω) =. X K (ω) The distribution of X is a probability measure P X : B K [0, 1]. This is usually called the joint distribution of the random vector X. We consider the case that P X has a density with respect to the counting measure (discrete distribution) or with respect to the Lebesgue measure (continuous distribution). The density f X (x 1,..., x K ) is called the joint density of X. 51

52 We have with Pr(X 1 B) = P X (B R... R) = =... f X (x 1,..., x K )dx 1... dx K = B f X1 (x 1 ) = = B f X1 (x 1 )dx 1... f X (x 1, x 2,..., x K )dx 2... dx K f X1 is called the marginal density of X 1. The marginal density of X k for any k is obtained in the same way. For discrete distributions replace integration by summation. 52

53 Consider subvectors X 1,..., X K1 and X K1 +1,..., X K. The distributions of these subvectors are independent if and only if f X (x 1,..., x K ) = f X1...X K1 (x 1,..., x K1 )f XK X K (x K1 +1,..., x K ) i.e. the joint density is the product of the marginal densities. The conditional distribution of X 1,..., X K1 give X K1 +1,..., X K has density f X1...X K1 X K X K (x 1,..., x K1 x K1 +1,..., x K ) = f X (x 1,..., x K ) f XK X K (x K1 +1,..., x K ) i.e. it is the ratio of the joint density and the marginal density of the variables on which we condition. 53

54 If X is any subvector of X that does not have X1 as a component, then the conditional mean of X 1 given X = x can be computed using the conditional density of X 1 given X E(X 1 X = x) = x 1 f X1 X (x 1 x)dx 1 R For a discrete distribution replace integration by summation. The conditional variance of X 1 given X is Var(X 1 X = x) = E ((X 1 E(X 1 X ) = x)) 2 X = x We have Var(X 1 X = x) = = E (X 1 2 2X 1 E(X 1 X = x) + E(X 1 X ) = x) 2 X = x = = E ( X1 2 X = x ) 2E(X 1 X = x)) 2 +E(X 1 X = x)) 2 = = E ( ) X1 2 X = x E(X 1 X = x)) 2 Compare this result to that for the unconditional variance. 54

55 Law of iterated expectations: E(X 1 ) = E X(E X1 X (X 1 X)) Remember that on the rhs we just integrate E(X 1 X = x) with respect to the distribution of X. For the variance note [ E X Var(X 1 X) ] [ = E X E ( )] X1 2 X E X [E(X 1 X)) ] 2 and because E(X 1 X) is a random variable that is a function of X [ Var E(X 1 X) ] = E X [E(X 1 X) ] ( [ 2 E X E(X 1 X) ]) 2 we obtain if we add these equations ( E Var(X 1 X) ) +Var(E(X 1 X)) = E(X1) (E(X 2 1 )) 2 = Var(X 1 ) 55

56 Summary measures associated with multivariate distributions, i.e. distribution of a random vector X 56

57 Obvious: Means and variances of the random variables in X (marginal means and variances). In random vectors we also consider the covariance of any two components of X, say X 1 and X 2 Cov(X 1, X 2 ) = E [(X 1 E(X 1 ))(X 2 E(X 2 ))] The covariance is informative on the relation between X 1 and X 2, e.g. for a discrete distribution Cov(X 1, X 2 ) = (x 1 E(X 1 ))(x 2 E(X 2 ))f X1 X 2 (x 1, x 2 ) x 1 x 2 If outcomes with x 1 E(X 1 ) > 0 and x 2 E(X 2 ) > 0 or x 1 E(X 1 ) < 0 and x 2 E(X 2 ) < 0 (deviations go in same direction) are more likely than outcomes with x 1 E(X 1 ) > 0 and x 2 E(X 2 ) < 0 or x 1 E(X 1 ) < 0 and x 2 E(X 2 ) > 0 (deviations go in opposite directions), then. 57

58 In that case there is a positive association between X 1 and X 2. If the second type of outcomes are more likely Cov(X 1, X 2 ) < 0 and the association is negative. Note for constants c, d Cov(cX 1, dx 2 ) = cdcov(x 1, X 2 ) so that the size of Cov(X 1, X 2 ) is not a good measure of the strength of the association. To measure the strength we define the correlation coefficient of X 1, X 2 by ρ X1 X 2 = Cov(X 1, X 2 ) Var(X1 ) Var(X 2 ) 58

59 To derive its properties we need the Cauchy-Schwartz inequality Proof: Consider E(X 1 X 2 ) E(X 1 ) E(X 2 ) 0 E [ (tx 1 + X 2 ) 2] = t 2 E(X 2 1)+2tE(X 1 X 2 )+E(X 2 2) The rhs is a quadratic equation with at most one zero. The discriminant of the equation satisfies 4E(X 1 X 2 ) 2 4E(X 2 1)E(X 2 2) 0 Dividing by 4 and taking the square root gives the inequality. If then E [ (tx 1 + X 2 ) 2] = 0 Pr(tX 1 + X 2 = 0) = 1 i.e. the joint distribution is concentrated on the line tx 1 + x 2 = 0. 59

60 Properties of the correlation coefficient ρ cx1,dx 2 = ρ X1 X 2 By Cauchy-Schwartz Cov(X 1, X 2 ) = E [(X 1 E(X 2 ))(X 2 E(X 2 ))] E ((X 1 E(X 1 )) 2 ) E ((X 2 E(X 2 )) 2 ) so that ρ X1 X

61 Note ρ X1 X 2 = 1 Pr((X 2 E(X 2 )) = t(x 1 E(X 1 ))) = 1. Hence Pr(X 2 = a + bx 1 ) = 1 with a = E(X 2 ) te(x 1 ) and b = t. Note that Pr((X 2 E(X 2 )) = t(x 1 E(X 1 ))) = 1 Cov(X 1, X 2 ) = bvar(x 1 ) so that sign(ρ X1 X 2 ) = sign(cov(x 1, X 2 )) = sign(b). Conclusion: ρ X1 X 2 = 1 Pr(X 1 = a + bx 1 ) = 1 for b 0. If ρ X1 X 2 = 1 then b > 0 and if ρ X1 X 2 = 1 then b < 0 The correlation coefficient is a measure of the strength of the association and the extreme values correspond to a linear relation. 61

62 In the case of a multivariate distribution we organize the variances and covariances in a matrix, the variance(-covariance) matrix of X Var(X 1 ) Cov(X 1, X 2 ) Cov(X 1, X K ) Cov(X Var(X) = 1, X 2 ) Var(X 2 ) Cov(X 1, X K ) Var(X K ) Note that this is a symmetric K K matrix Var(X) = Var(X) Often we use the notation Var(X) = Σ 62

63 Remember if X is a K vector, then X 1 µ 1 (X µ)(x µ) =. [X 1 µ 1 X K µ K ] = X K µ K = (X 1 µ 1 ) 2 (X 1 µ 1 )(X 2 µ 2 ) (X 1 µ 1 )(X K µ K ) (X 1 µ 1 )(X 2 µ 2 ) (X 1 µ 1 ) (X 1 µ 1 )(X K µ K ) (X K µ K ) 2 so that if we denote µ = E(X) Σ = Var(X) = E ((X µ)(x µ) ) 63

64 Linear and quadratic functions of random vectors If X is a random vector with K components and a is a K vector of constants, we define the linear function of X K a X = a k X k k=1 Hence ( K ) E(a X) = E a k X k = Also k=1 K a k E(X k ) = a E(X) k=1 var(a X) = E [ (a X E(a X)) 2] = E [(a X a µ)(a X a µ)] = = E [(a X a µ)(x a µ a)] = E [a (X µ)(x µ) a] = = a [(X µ)(x µ) ] a 64

65 Moment generating function of a joint distribution If X is a random vector the mgf of X is M X (t) = E ( e t 1X 1 + +t K X K ) if the mgf exists for h < t k < h, k = 1,..., K. Note t 1 t =. Note that so that t K 2 M X (t) = E ( ) X 1 X 2 e t 1X 1 + +t K X K t 1 t 2 2 M X t 1 t 2 (0) = E (X 1 X 2 ) 65

66 This can be used to compute the covariance, because Cov(X 1, X 2 ) = E(X 1 X 2 ) E(X 1 )E(X 2 ) The mgf of the marginal distribution of X 1 is M X1 (t 1 ) = M X (t 1, 0,..., 0) 66

67 Special multivariate distributions Multinomial distribution Binomial distribution: Number of 1 s in n independent Bernoulli experiments. Instead of Bernoulli experiment with two outcomes consider random experiment wit K outcomes k = 1,..., K. Example is to pick student at random from class and record his/her nationality. Label nationalities with label k = 1,..., K. If fraction with nationality k is p k, then if outcome of random selection is Y we have Pr(Y = k) = p k with K k=1 p k = 1., k = 1,..., K 67

68 Repeat this experiment n times and let the repetitions be independent. Define X k is number of experiments with outcome k. Note K k=1 X k = n, so that X K is determined by X 1,..., X K 1. Consider a sequence of n outcomes Experiment n 1 n Outcome K 1 K Probability p 3 p 4 p 1 p 1... p K 1 p K Probability of this sequence is with x K = n K 1 k=1 x k. p x 1 1 px 2 2 px K 1 K 1 px K K To compute Pr(X 1 = x 1,..., X K 1 = x K 1 ) we count the number of such sequences. 68

69 This is equivalent to Pick x 1 experiments with outcome 1, x 2 with outcome 2 etc. from the n experiments. Start with picking the x 1 experiments with outcome 1 among ( ) the n experiments. This can be n done in ways. x 1 From the remaining n x 1 experiments pick the experiments ( ) with outcome 2. This can be done n x1 in ways. x 2 The total number of ways to choose the experiments with outcomes 1 and 2 is: ( ) ( ) n n x1 n! = x 1!x 2!(n x 1 x 2 )! x 1 x 2 Using the same argument repeatedly we find that the total number of ways to choose the experiments with outcomes 1, 2,... K is: n! x 1! (n x 1 x K 1 )! = n! x 1! x K! 69

70 Hence Pr(X 1 = x 1,..., X K 1 = x K 1 ) = n! x 1! x K! px 1 1 px 2 2 px K 1 K 1 px K K The Multinomial joint density of X 1,..., X K 1 is n! K f X (x 1,..., x K 1 ) = K k=1 x p x k k 0 x k n, k! k=1 = 0 otherwise Multinomial formula (a a K ) n = x 1 + +x K =n n! x 1! x K! ax 1 1 ax K K K x k = n k=1 70

71 Using this the mgf is = x 1 + +x K =n M X (t) = E ( e t 1X 1 + t K 1 X K 1 ) = n! x 1! x K! ( K 1 = k=1 From the mgf we find ( e t 1 p 1 ) x1 (e t K 1 p K 1 ) xk 1 p x K K = e t k p k + p K ) n E(X k ) = np k Var(X k ) = np k (1 p k ) Cov(X k, X l ) = np k p l Exercise: What is the marginal distribution of X k? What is the conditional distribution of X 1, X 2 given X 3 = x 3,..., X K 1 = x K 1? 71

72 Multivariate normal distribution The K random vector X has K-dimensional Multivariate normal distribution if its distribution has a density with respect to the K-dimensional Lebesgue measure equal to f X (x) = 1 Σ 1 2(2π) K 2 e 1 2 (x µ) Σ 1 (x µ), < x < By completion of squares (see 1-dimensional case) the mgf is M X (t) = e t µ+ 1 2 t Σt Exercise: Derive the mgf. Hence E(X) = µ Exercise: Derive these results. Var(X) = Σ The marginal distribution of X k normal with mean µ k and variance σk 2, the k-th element of the main diagonal of Σ. Exercise: Prove this using the mgf. 72

73 Special case K = 2, the bivariate normal distribution. Let the random vector be ( ) Y X The conditional distribution of Y given X = x is normal with E(Y X = x) = µ Y + σ XY (X µ X ) Var(Y X = x) = σ 2 Y with σ XY = Cov(X, Y ) σ 2 X ( ) 1 σ2 ( ) XY σx 2 = σ 2 σ2 Y 1 ρ 2 XY Y The conditional mean is linear in x. Compare wit result that Pr(Y = a+bx) = 1 if and only if ρ XY = 1. 73

74 Regression fallacy or regression to the mean Francis Galton ( ), observed that tall fathers have on average shorter sons, and short fathers have on average taller sons (in Victorian England mothers and daughters did not count). If this process were to continue, one would expect that in the long run extremes would disappear and all fathers and sons will have the average height. Using the same reasoning: Short sons have on average taller fathers (with a height closer to the mean) and tall sons have on average smaller fathers (again with a height closer to the mean). By this argument there is a tendency to move away from the mean! Similar observations can be made about many phenomena: Rookie players who do exceptionally well in the first year, tend to have a slump in the second; bringing in new management when a company underperforms seems to improve performance etc. 74

75 Analysis X Y = height of father = height of son Reasonable assumption: X, Y have a bivariate normal distribution with E(X) = E(Y ) = µ Var(X) = var(y ) = σ 2 0 < ρ XY < 1 75

76 Hence E(Y X = x) = µ + ρ(x µ) If x > µ 0 < E(Y X = x) µ < x µ i.e. average height of sons with fathers with more than average height is closer to the mean. If x < µ 0 > E(Y X = x) µ > x µ i.e. average height of sons with fathers with less than average height is closer to the mean. However, heights of fathers and sons have the same (normal) distribution, i.e. no change over the generations. 76

77 The distribution of linear and quadratic functions of normal random vectors X is a K random vector with X N(µ, Σ) Consider the random variables (i) Y 1 = a X with a a K vector of constants (scalar). (ii) Y 2 = AX + b with A an M K matrix and b an M vector of constants. (iii) Y 3 = X CX with C an K K matrix of constants. C is symmetric. 77

78 From the mgf of Y 1 and Y 2 we find (i) Y 1 N(a µ, a Σa).Exercise: Derive this. (ii) Y 2 N(Aµ + b, AΣA ) Exercise: Derive this. We verify E(Y 2 ) = AE(X) + b = Aµ + b Var(Y 2 ) = E [(Y 2 Aµ b)(y 2 Aµ b) ] = = E [(AX Aµ)(AX Aµ) ] = E [A(X µ)(x µ)a ] = = AE [(X µ)(x µ)] A = AΣA (iii) Special case X N(0, I) and C idempotent, i.e. C 2 = C, the matrix generalization of unity. 78

79 P is the K K matrix of eigenvectors of C and choose P such that P P = I, i.e. P is orthonormal. Define the diagonal matrix of eigenvalues λ 1 0 Λ = λ K We have and CP = P Λ P CP = Λ C = P ΛP because by P P = I we have P = (P ) 1. Hence P ΛP = C = C 2 = P Λ 2 P so that Λ 2 = Λ 79

80 This implies that λ k is either 0 or 1. Let L be 1 and consider so that Z N(0, 1). Hence Z = P X Y 3 = X P ΛP X = Z ΛZ = Finally K λ k ZK 2 χ 2 (L) k=1 tr(c) = tr(p ΛP ) = tr(λp P ) = tr(λ) = L Let X 1 and X 2 be subvectors of X of dimensions K 1 and K 2 with K 1 +K 2 = K. Then the variance matrix of X is ( ) Σ11 Σ Σ = 12 Σ 12 Σ 22 with Var(X 1 ) = Σ 11 and Var(X 2 ) = Σ 22 and Σ 12 = E((X 1 µ 1 )(X 2 µ 2 ) ). We have that X 1 and X 2 are independent if and only if Σ 12 = 0. To see this note that if Σ 12 = 0 ( ) 1 ( ) Σ 1 Σ11 0 Σ = = 0 Σ 22 0 Σ 1 22 Hence (x µ) Σ 1 (x µ) = (x 1 µ 1 ) Σ 1 11 (x 1 µ 1 )+(x 2 µ 2 ) Σ 1 22 (x 2 µ 2 ) Substitution in the density of the multivariate normal distribution shows that this density factorizes in a function of x 1 and a function of x 2, which establishes that these random vectors are independent. 80

81 Conclusion: In the normal distribution X 1, X 2 are independent if and only if Cov(X 1, X 2 ) = 0. Define Y 4 = X BX with B idempotent. Then if X N(0, I) (i) Y 1 and Y 3 are independent if and only if Ba = 0. (ii) Y 3 and Y 4 are stochastically independent if and only if BC = CB = 0. Proof: (i) Y 3 = X CX = X C 2 X = X C CX which is a function of CX. Hence Y 1 and Y 3 are independent if and only if Cov(BX, a X) = E(BXX a) = Ba = 0 (ii) Y 3 = X C CX and Y 4 = X D DX so that Y 3 and Y 4 are independent if and only if Cov(BXX C ) = BC = 0. 81

Chapter 5. Chapter 5 sections

Chapter 5. Chapter 5 sections 1 / 43 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions