Introduction to Probability Theory

Introduction to Probability Theory Ping Yu Department of Economics University of Hong Kong Ping Yu (HKU) Probability 1 / 39

Foundations 1 Foundations 2 Random Variables 3 Expectation 4 Multivariate Random Variables 5 Conditional Distributions and Expectation 6 Transformations 7 The Normal and Related Distributions Ping Yu (HKU) Probability 2 / 39

Foundations Foundations Ping Yu (HKU) Probability 2 / 39

Foundations Founder of Modern Probability Theory Andrey N. Kolmogorov (1903-1987), Russian Ping Yu (HKU) Probability 3 / 39

Foundations Sample Space, Event and Probability The set Ω of all possible outcomes of an experiment is called the sample space for the experiment. - Take the simple example of tossing a coin. There are two outcomes, heads and tails, so we can write Ω = fh,t g. - If two coins are tossed in sequence, we can write the four outcomes as Ω = fhh,ht,th,tt g. An event A is any collection of possible outcomes of an experiment. An event is a subset of Ω, including Ω itself and the null set /0. - Continuing the two coin example, one event is A = fhh,ht g, the event that the first coin is heads. We say that A and B are disjoint or mutually exclusive if A \ B = /0. - For example, the sets fhh,ht g and fthg are disjoint. If the sets A 1,A 2, are pairwise disjoint and [ i=1 A i = Ω, then the collection A 1,A 2, is called a partition of Ω. A probability function P assigns probabilities (numbers between 0 and 1) to events A in Ω. - P satisfies P(Ω) = 1, P(A c ) = 1 P(A), 1 P(A) P(B) if A B. 1 This implies P (/0) = 0. Ping Yu (HKU) Probability 4 / 39

Foundations Conditional Probability and Independence of Two Events If P(B) > 0 the conditional probability of the event A given the event B is P(AjB) = - Rearranging the definition, we can write P(A \ B). P(B) P(A \ B) = P(AjB)P(B). We say that the events A and B are statistically independent when P(A \ B) = P(A)P(B), which implies P(AjB) = P(A), i.e., the occurrence of B has no information about the likelihood of event A. We say that the collection of events A 1,...,A k are mutually independent when for any subset fa i : i 2 Ig,! P \ A i = P(A i ). i2i i2i Ping Yu (HKU) Probability 5 / 39

Foundations Bayes Rule or Theorem Theorem (Bayes Rule or Theorem) For any set B and any partition A 1,A 2, of the sample space, then for each i = 1,2, P (A i jb) = P(BjA i )P(A i ) j=1 P(BjA j )P(A j ). Example (Monty Hall Problem) The Monty Hall problem is a brain teaser, first appearing on the American television game show Let s Make a Deal and named after its original host, Monty Hall. Problem: Suppose you re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice? Ping Yu (HKU) Probability 6 / 39

Foundations Thomas Bayes (1701-1761), English Thomas Bayes (1701-1761) was an English statistician. He never published what would eventually become his most famous accomplishment; his notes were edited and published after his death by Richard Price. Ping Yu (HKU) Probability 7 / 39

Foundations Figure: Tree Showing the Probability of Every Possible Outcome If the Player Initially Picks Door 1 Ping Yu (HKU) Probability 8 / 39

Foundations Monty Hall Problem: A Rigorous Analysis Consider the events C1, C2 and C3 indicating the car is behind respectively door 1 2 or 3. All these 3 events have probability 1/3. The player initially choosing door 1 is described by the event X1. As the first choice of the player is independent of the position of the car, also the conditional probabilities are P(CijX1) = 1/3, i = 1,2,3. The host opening door 3 is described by H3. For this event it holds: P (H3jC1,X1) = 1/2, P (H3jC2,X1) = 1 and P (H3jC3,X1) = 0. If the player initially selects door 1, and the host opens door 3, the conditional probability of winning by switching is P (C2jH3,X1) = = = = P (H3jC2,X1)P (C2 \ X1) P (H3 \ X1) P (H3jC2,X1)P (C2 \ X1) 3 i=1 P (H3jCi,X1)P (Ci \ X1) P (H3jC2,X1) 3 i=1 P (H3jCi,X1) 1 1/2 + 1 + 0 = 2 3. Ping Yu (HKU) Probability 9 / 39

Random Variables Random Variables Ping Yu (HKU) Probability 10 / 39

Random Variables Random Variables and CDFs A random variable (r.v.) X is a function from a sample space Ω into the real line. - The r.v. transforms the abstract elements in Ω to analyzable real values. - Notations: we denote r.v. s by uppercase letters such as X, and use lowercase letters such as x for potential values and realized values. - Caution: Be careful about the difference between a random variable and its realized value. The former is a function from outcomes to values while the latter is a value associated with a specific outcome. For a r.v. X we define its cumulative distribution function (cdf) as F (x) = P(X x). - Notations: Sometimes we write this as F X (x) to denote that it is the cdf of X. The support of a r.v. X is the closure of the set of possible values that X can take, denoted as supp(x ). Ping Yu (HKU) Probability 11 / 39

Random Variables Discrete Variables The r.v. X is discrete if F (x) is a step function. - The support of a discrete r.v. consists of finite or countably many values, x 1,,x J, where J can be The probability function for X takes the form of the probability mass function (pmf) where 0 p j 1 and J j=1 p j = 1. P X = x j = pj, j = 1,,J, (1) - F (x) = J j=1 p j1(x j x), where 1() is the indicator function which equals one when the event in the parenthesis is true and zero otherwise. A famous discrete r.v. is the Bernoulli (or binary) r.v., where J = 2, x 1 = 0, x 2 = 1, p 2 = p and p 1 = 1 p. [Figure here] - The Bernoulli distribution is often used to model sex, employment status, and other dichotomies. Ping Yu (HKU) Probability 12 / 39

Random Variables Jacob Bernoulli (1655-1705), Swiss Jacob Bernoulli (1655-1705) was one of the many prominent mathematicians in the Bernoulli family. Ping Yu (HKU) Probability 13 / 39

Random Variables Continuous Random Variables The r.v. X is continuous if F (x) is continuous in x. - In this case P (X = x) = 0 for all x 2 R so the representation (1) is unavailable. We instead represent the relative probabilities by the probability density function (pdf) f (x) = d dx F (x). - A function f (x) is a pdf iff f (x) 0 for all x 2 R and R f (x)dx = 1. - The support of a continuous r.v. is fx 2 Rjf (x) > 0g. By the fundamental theorem of calculus, Z x F (x) = f (u)du and Z b P (a X b) = F (b) F (a) = f (u)du. a Ping Yu (HKU) Probability 14 / 39

Random Variables Examples of Continuous R.V.s A famous continuous r.v. is the normal r.v. [Figure here] The standard normal density is φ(x) = 1 p 2π exp! x 2, < x <. 2 - Notations: write X N(0,1), and denote the standard normal cdf by Φ(x). Φ(x) has no closed-form solution. [Figure here] - The normal distribution is often used to model human heights and weights, test scores, and county unemployment rates. Another famous continuous r.v. is the standard uniform r.v. whose pdf is f (x) = 1(0 x 1), i.e., X can occur only on [0,1] and occurs uniformly. We denote X U[0,1]. [Figure here] - a generalization is X U[a,b], a < b. Ping Yu (HKU) Probability 15 / 39

Random Variables Carl F. Gauss (1777-1855), Göttingen The normal distribution was invented by Carl F. Gauss (1777-1855), so it is also known as the Gaussian distribution. Ping Yu (HKU) Probability 16 / 39

Random Variables 0.5 Bernoulli Distribution Standard Normal Distribution 0.5 0.4 0.3 0.2 1 0.8 0.6 0.4 Uniform Distribution 0.1 0.2 0 0 1 0 3 0 1 3 0 0 1 1 1 1 0.5 0.5 0.5 0 1 0 1 2 0 3 0 1 3 0 0 1 Figure: PMF, PDF and CDF: p = 0.5 in the Bernoulli Distribution A function F (x) is a cdf iff the following three properties hold: (i) lim F (x) = 0 x! and lim F (x) = 1; (ii) F (x) is nondecreasing in x; (iii) F (x) is right-continuous, x! i.e., F (x) = limf (t). t#x Ping Yu (HKU) Probability 17 / 39

Expectation Expectation Ping Yu (HKU) Probability 18 / 39

Expectation Expectation For any real function g, we define the mean or expectation E [g(x )] as follows: if X is discrete, J E [g(x )] = g(x j )p j, j=1 and if X is continuous Z E [g(x )] = g(x)f (x)dx. - The mean is a weighted average of all possible values of X with the weights determined by its pmf or pdf. To cover both the discrete and continuous case (and even more general cases), we often write Z E [g(x )] = g(x)df (x), which is a Riemann-Stieltjes integral (why intuitively correct? see Chapter 1). Since E [a + bx ] = a + b E [X ], we say that expectation is a linear operator. Ping Yu (HKU) Probability 19 / 39

Expectation Moments For m > 0, we define the m 0 th moment of X as E [X m ], the m 0 th central moment as E (X E[X ]) m, the m 0 th absolute moment of X as E jxj m, and the m 0 th absolute central moment as E jx E[X ]j m. Two special moments are the first moment - the mean µ = E[X ], which is a measureh of central tendency, and the second central moment - the variance σ 2 = E (X µ) 2i h = E X 2i µ 2, which is a measure of variability. - We call σ = p σ 2 the standard deviation of X. - We also write σ 2 = Var(X ), which allows the convenient expression Var(a + bx ) = b 2 Var(X ). The normal density has all moments finite. - Since it is symmetric about h zero all odd moments are zero. - We can also show that E X 2i h = 1 and E X 4i = 3. Ping Yu (HKU) Probability 20 / 39

Multivariate Random Variables Multivariate Random Variables Ping Yu (HKU) Probability 21 / 39

Multivariate Random Variables Bivariate Random Variables A pair of bivariate r.v. s (X,Y ) is a function from the sample space into R 2. The joint cdf of (X,Y ) is If F is continuous, the joint pdf is - For any set A R 2, F (x,y) = P (X x,y y). f (x,y) = 2 x y F (x,y). Z Z P ((X,Y ) 2 A) = f (x,y)dxdy. A - The discrete case can be parallely discussed. For any function g(x,y), E [g(x,y )] = Z Z g(x,y)f (x,y)dxdy. Ping Yu (HKU) Probability 22 / 39

Multivariate Random Variables Marginal Distributions The marginal distribution of X is Z x Z F X (x) = P(X x) = lim F (x,y) = f (x,y)dydx, y! So by the fundamental theorem of calculus, the marginal density of X is Similarly, the marginal density of Y is f X (x) = d Z dx F X (x) = f (x,y)dy. f Y (y) = Z f (x,y)dx. Ping Yu (HKU) Probability 23 / 39

Multivariate Random Variables Independence of Two R.V.s The r.v. s X and Y are defined to be independent if f (x,y) = f X (x)f Y (y). - X and Y are independent iff there exist functions g(x) and h(y) such that f (x,y) = g(x)h(y). (Exercise) If X and Y are independent, then Z Z E [g(x )h(y )] = g(x)h(y)f (x, y)dxdy (2) Z Z = g(x)h(y)f X (x)f Y (y)dxdy Z Z = g(x)f X (x)dx h(y)f Y (y)dy = E [g(x )]E [h(y )]. - For example, if X and Y are independent then E [XY ] = E [X ]E [Y ]. Ping Yu (HKU) Probability 24 / 39

Multivariate Random Variables Covariance and Correlation The covariance between X and Y is Cov(X,Y ) = σ XY = E [(X E [X ]) (Y E [Y ])] = E [XY ] E [X ]E [Y ]. The correlation between X and Y is Corr(X,Y ) = ρ XY = The Cauchy-Schwarz Inequality implies that jρ XY j 1. σ XY σ X σ Y. The correlation is a measure of linear dependence, free of units of measurement. [Figure here] If X and Y are independent, then σ XY = h 0 and ρ XY = 0. The reverse, however, is not true. For example, if E [X ] = 0 and E X 3i = 0, then Cov(X,X 2 ) = 0. Var(X + Y ) = Var(X ) + Var(Y ) + 2Cov(X,Y ). - if X and Y are independent, then Var(X + Y ) = Var(X ) + Var(Y ) - the variance of the sum is the sum of the variances. Ping Yu (HKU) Probability 25 / 39

Multivariate Random Variables More Than Two R.V.s A k 1 random vector X = (X 1,,X k ) 0 is a function from Ω to R k. The vector X has the distribution and density functions F (x) = P (X x), f (x) = where x = (x 1,,x k ) 0 denote a vector in R k. k x 1 x k F (x), For any function g : R k! R s, define the expectation Z E [g(x )] = g(x)f (x)dx, R k where the symbol dx denotes dx 1 dx k. - the k 1 multivariate mean µ = E [X ]. - the k k covariance matrix Σ = E (X µ) (X µ) 0 = E XX 0 µµ 0. If the elements of X are mutually independent, then Σ is a diagonal matrix and Var! k k X i = Var (X i ). i=1 i=1 Ping Yu (HKU) Probability 26 / 39

Conditional Distributions and Expectation Conditional Distributions and Expectation Ping Yu (HKU) Probability 27 / 39

Conditional Distributions and Expectation Conditional Distributions and Expectation The conditional density of Y given X = x is defined as if f X (x) > 0. f Y jx (yjx) = f (x,y) f X (x) The conditional mean or conditional expectation is the function Z m(x) = E [Y jx = x] = yf Y jx (yjx)dy. The conditional mean m(x) is a function, meaning that when X equals x, then the expected value of Y is m(x). The conditional variance of Y given X = x is h σ 2 (x) = Var (Y jx = x) = E (Y m(x)) 2 i X = x h = E Y 2 i X = x m(x) 2. If Y and X are independent, then E [Y jx = x] = E [Y ] and Var (Y jx = x) = Var (Y ). (why?) Ping Yu (HKU) Probability 28 / 39

Conditional Distributions and Expectation Example 8xy, if 0 < x < y < 1, Suppose f XY (x,y) = Are X and Y independent? What are 0, otherwise. the marginal densities of X and Y? What is the conditional density of Y given X? What are E [Y jx = x] and Var (Y jx = x)? Solution It is tempting to claim that X and Y are independent since it seems that f (x,y) = g(x)h(y) with g(x) = 8x and h(y) = y. However, such an argument neglects the support of (X,Y ) which is shown in the left panel of the following figure. The marginal density of X is f X (x) = R 1 x 8xydy = 4x(1 x 2 ),0 < x < 1, and the marginal density of Y is f Y (y) = R y 0 8xydx = 4y 3,0 < y < 1. Obviously, so X and Y are not independent. f XY (x,y) = 8xy 1(0 < x < y < 1) 6= 4x(1 x 2 ) 1(0 < x < 1)4y 3 1(0 < y < 1) = f X (x)f Y (y), Ping Yu (HKU) Probability 29 / 39

Conditional Distributions and Expectation Solution [continued] The conditional density of Y given X = x is f Y jx (yjx) = f (x,y) ( 8xy f X (x) = 4x(1 x 2 ) = 2y 1 x, 2 0, So when 0 < x < 1, and E [Y jx = x] = h i E Y 2 jx = x = h i Var (Y jx = x) = E Y 2 jx = x if 0 < x < y < 1, otherwise. Z 1 2y y x 1 x 2 dy = 2 1 + x + x 2, 3 1 + x Z 1 y 2 2y x 1 x 2 dy = 1 2 1 + x = 1 1 + x + x 2 + x 3 2 1 + x 1 + x + x 2 + x 3, E [Y jx = x] 2 2 3 = (1 x)2 (1 + 4x + x 2 ) 18(1 + x) 2. 1 + x + x 2 2 1 + x The right panel of the figure shows E [Y jx = x] and Var (Y jx = x) as a function of x. E [Y jx = x] is increasing with E [Y jx = 0] = 2/3 and E [Y jx = 1] = 1, and Var (Y jx = x) is decreasing with Var (Y jx = 0) = 1/18 and Var (Y jx = 1) = 0. Since E [Y jx = x] and Var (Y jx = x) are not constant, X and Y are not independent. Ping Yu (HKU) Probability 30 / 39

Conditional Distributions and Expectation Figure: supp(x,y ), E [Y jx = x] and Var (Y jx = x) Ping Yu (HKU) Probability 31 / 39

Transformations Transformations Ping Yu (HKU) Probability 32 / 39

Transformations Transformations Suppose that X 2 R k with continuous distribution function F X (x) and density f X (x). Let Y = g(x ) where g(x) : R k! R k is one-to-one, differentiable, and invertible. Let h(y) denote the inverse function of g(x). Recall that the Jacobian of h is defined as J(y) = det(dh(y)), where for a matrix A, det(a) is the determinant of A. Consider the univariate case k = 1. If g(x) is a strictly increasing function, then g(x ) y iff X h(y), so the distribution function of Y is F Y (y) = P(g(X ) y) = P (X h(y)) = F X (h(y)). Taking the derivative, the density of Y is f Y (y) = d dy F Y (y) = f X (h(y)) d dy h(y). Ping Yu (HKU) Probability 33 / 39

Transformations continue... If g(x) is a strictly decreasing function, then g(x ) y iff X h(y), so F Y (y) = P(g(X ) y) = P (X h(y)) = 1 F X (h(y)). and the density of Y is f Y (y) = d dy F Y (y) = f X (h(y)) d dy h(y). We can write these two cases jointly as f Y (y) = f X (h(y))jj(y)j, (3) where jj(y)j is the absolute value of J(y) (why? f Y (y) 0) This is known as the change-of-variables formula. - This same formula (3) holds for k > 1. Ping Yu (HKU) Probability 34 / 39

Transformations Example Suppose X U[0,1] and Y = log(x ). Here, g(x) = log(x) and h(y) = exp( y) so the Jacobian is J(y) = exp( y). As the range of X is [0,1], that for Y is [0, ]. Since f X (x) = 1 for 0 x 1, (3) shows that f Y (y) = exp( y), 0 y, an exponential density. Example If Z is standard normal and X = µ + σz, then the Jacobian is 1/σ and h(x) = x µ σ. Since Z has the density X has density φ(z) = 1 p 2π exp f (x) = 1 p 2πσ exp! z 2, < z <. 2! (x µ) 2 2σ 2, < x <. which is the univariate normal density. The mean and variance of the distribution are µ and σ 2, and it is conventional to write X N(µ,σ 2 ). Ping Yu (HKU) Probability 35 / 39

The Normal and Related Distributions The Normal and Related Distributions Ping Yu (HKU) Probability 36 / 39

The Normal and Related Distributions The Normal Distribution For x 2 R k, the multivariate normal density is! 1 f (x) = p 2π det(σ) 1/2 exp (x µ) 0 Σ 1 (x µ),x 2 R k. 2 - the mean and covariance matrix of the distribution are µ and Σ. - when X has the multivariate normal density, we call X follows the multivariate normal distribution and write X N(µ,Σ). If X 2 R k is multivariate n o normal and the elements of X are mutually uncorrelated, then Σ = diag σ 2 j is a diagonal matrix. In this case the density function can be written as f (x) = = 1 p 2πσ 1 σ k exp 0 k 1 B p exp@ j=1 2πσ j (x 1 µ 1 ) 2 /σ 2 1 + + (x k µ k ) 2 /σ 2 k 2 2 1 x j µ j 2σ 2 j which is the product of marginal univariate normal densities. This shows that if X is multivariate normal with uncorrelated elements, then they are mutually independent. Ping Yu (HKU) Probability 37 / 39 C A,!

The Normal and Related Distributions The Chi-Square Distribution If X N (µ,σ) is a k-dimensional multivariate normal vector, and Y = a + BX, where B is an m k matrix with full row rank, then Y N a + Bµ,BΣB 0 is a m-dimensional multivariate normal vector. - An affine transformation of a multivariate normal vector is normally distributed. Let X N (0,I r ). Then Q = X 0 X is distributed chi-square with r degree of freedom (or df for short), written as χ 2 r. - positive, asymmetric, E[X ] = r and Var(X ) = 2r (why?). [Figure here] If Z N (0,A) with A > 0, q q, then Z 0 A 1 Z χ 2 q. (why?) The t-distribution and F -distribution are also related to the normal distribution, but are used only when the sample size is small. In modern times, the sample size tends to be huge, so the chi-square distribution is the more suitable "large-sample" distribution. Ping Yu (HKU) Probability 38 / 39

Density The Normal and Related Distributions 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 8 Figure: Chi-Square Density for r = 1,2,3,4,6,9 Ping Yu (HKU) Probability 39 / 39