Multivariate Distributions (Hogg Chapter Two)

Multivariate Distributions (Hogg Chapter Two) STAT 45-1: Mathematical Statistics I Fall Semester 15 Contents 1 Multivariate Distributions 1 11 Random Vectors 111 Two Discrete Random Variables 11 Two Continuous Random Variables 4 1 Marginalization 5 13 Expectation Values 6 131 Moment Generating Function 6 14 Transformations 6 141 Transformation of Discrete RVs 6 14 Transformation of Continuous RVs 7 Conditional Distributions 9 1 Conditional Probability 9 Conditional Probability Distributions 1 1 Example 11 Conditional Expectations 11 3 Independence 13 31 Dependent rv example #1 13 3 Dependent rv example #1 14 Copyright 15, John T Whelan, and all that 33 Factoring the joint pdf 14 34 Expectation Values 15 3 Covariance and Correlation 15 4 Generalization to Several RVs 15 41 Linear Algebra: Reminders and Notation 16 4 Multivariate Probability Distributions 18 41 Marginal and Conditional Distributions 18 4 Mutual vs Pairwise Independence 19 43 The Variance-Covariance Matrix 44 Transformations of Several RVs 1 45 Example #1 46 Example # (fewer Y s than Xs) 3 Tuesday 8 September 15 Read Section 1 of Hogg 1 Multivariate Distributions We introduced a random variable X as a function X(c) which assigned a real number to each outcome c in the sample space 1

C There s no reason, of course, that we can t define multiple such functions, and we now turn to the formalism for dealing with multiple random variables at the same time 11 Random Vectors We can think of several random variables X 1, X, X n as making up the elements of a random vector X 1 X X n X (11) We can define an event X A corresponding to the random vector X lying in some region A R n, and use the probability function to define the probability P (X A) of this event We ll focus on n initially, and define the joint cumulative distribution function of two random variables X 1 and X as F X1,X (x 1, x ) P ([X 1 x 1 ] [X x ]) (1) As in the case of a single random variable, this can be used as a starting point for defining the probability of any event we like For example, with a bit of algebra it s possible to show that P [(a 1 < X 1 b 1 ) (a < X b )] F (b 1, b ) F (b 1, a ) F (a 1, b ) + F (a 1, a ) (13) where we have suppressed the subscript X 1, X since there s only one cdf of interest at the moment We won t really dwell on this, though, since it s a lot easier to work with joint probability mass and density functions 111 Two Discrete Random Variables If both random variables are discrete, ie, they can take on either a finite set of values, or at most countably many values, the joint cdf will once again be constant aside from discontinuities, and we can describe the situation using a joint probability mass function p X1,X (x 1, x ) P [(X 1 x 1 ) (X x )] (14) We give an example of this, in which we for convenience refer to the random variables as X and Y Recall the example of three flips of a fair coin, in which we defined X as the number of heads Now let s define another random variable Y, which is the number of tails we see before the first head is flipped (If all three flips are tails, then Y is defined to be 3 We can work out the probabilities by first just enumerating all of the outcomes, which are assumed to have equal probability because the coin is fair: outcome c P (c) X value Y value HHH 1/8 3 HHT 1/8 HTH 1/8 HTT 1/8 1 THH 1/8 1 THT 1/8 1 1 TTH 1/8 1 TTT 1/8 3

We can look through and see that 1/8 x 3, y /8 x, y 1/8 x 1, y 1/8 x, y 1 p X,Y (x, y) 1/8 x 1, y 1 1/8 x 1, y 1/8 x, y 3 otherwise This is most easily summarized in a table: (15) out the joint cdf, but if you do, you get x < x < 1, y < 3 1/8 x < 1, 3 y 1 x, y < 1/8 1 x <, y < 1 /8 1 x <, 1 y < 3/8 1 x <, y < 3 F X,Y (x, y) 4/8 1 x <, 3 y 3/8 x < 3, y < 1 5/8 x < 3, 1 y < 6/8 x < 3, y < 3 8/8 x, 3 y 4/8 3 x, y < 1 7/8 3 x, 1 y < 3 (16) This is not very enlightening, and not really much more so if you plot it: y p X,Y (x, y) 1 3 1/8 1 1/8 1/8 1/8 x /8 1/8 3 1/8 y 3 1 1/8 4/8 3/8 /8 1/8 8/8 6/8 5/8 3/8 8/8 7/8 7/8 4/8 Note that it s a lot more convenient to work with the joint pmf than the joint cdf It takes a fair bit of concentration to work 1 3 x 3

So instead, we ll work with the joint pmf and use it to calculate probabilities like and P (X + Y ) p X,Y (, ) + p X,Y (1, 1) 8 + 1 8 3 8 P [( < X ) (Y 1)] In general, p X,Y (1, ) + p X,Y (1, 1) + p X,Y (, ) + p X,Y (, 1) P [(X, Y ) A] (17) 1 8 + 1 8 + 8 + 1 8 5 8 (18) (x,y) A 11 Two Continuous Random Variables p X,Y (x, y) (19) For example, for a rectangular region we have b ( d ) P [(a<x <b) (c<y <d)] f X,Y (x, y) dy dx (11) a We can connect the joint pdf to the joint cdf by considering the event ( < X x) ( < Y y): x ( y ) P [(<X x) (<Y y)] f X,Y (t, u) du dt, (113) where we have called the integration variables t and u rather than x and y because the latter were already in use As another example, for the event X + Y < c, where the region of integration looks like this: c c On the other hand, we may be dealing with continuous random variables, which means that the joint cdf F X,Y (x, y) is continuous Then we can proceed as before and define a probability density function by taking derivatives of the cdf In this case, since F X,Y (x, y) has multiple arguments, we take the partial derivative so that the joint pdf is f X,Y (x, y) F X,Y (x, y) x y (11) We won t worry too much about this derivative for now, since in practice, we will start with the joint pdf rather than differentiating the joint cdf to get it We can use the pdf to assign a probability for the random vector (X, Y ) to be in some region of the (x, y) plane P [(X, Y ) A] f X,Y (x, y) dx dy (111) (x,y) A we have y c c P (X + Y < c) c c x ( c x ( c y c f X,Y (x, y) dy ) dx ) f X,Y (x, y) dx dy (114) 4

For a more explicit demonstration of why this works, consult your notes from Multivariable Calculus (specifically Fubini s theorem) and/or Probability and/or http://ccrgrit edu/~whelan/courses/13_1sp_116_345/notes5pdf As an example, consider the joint pdf e x y < x <, < y < f(x, y) (115) otherwise and the event X < Y The region over which we need to integrate is x >, y >, x < y: y 1 1 x If we do the y integral first, the limits will be set by x/ < y <, and if we do the x integral first, they will be < x < y Doing the y integral first will give us a contribution from only one end of the integral, so let s do it that way P (X < Y ) x/ e x y dy dx e x e x/ dx e x [ e y] x/ dx e 3x/ dx 3 e 3x/ 3 (116) 1 Marginalization One of the events we can define given the probability distribution for two random variables X and Y is X x for some value x In the case of a pair of discrete random variables, this is P (X x) y p X,Y (x, y) But of course, P (X x) is just the pmf of X; we call this the marginal pmf p X (x) and define p X (x) P (X x) y p X (y) P (Y y) x p X,Y (x, y) (117) p X,Y (x, y) (118) (119) Returning to our coin-flip example, we can write the marginal pmfs for X and Y in the margins of the table: y p X,Y (x, y) 1 3 p X (x) 1/8 1/8 x 1 1/8 1/8 1/8 3/8 /8 1/8 3/8 3 1/8 1/8 p Y (y) 4/8 /8 1/8 1/8 For a pair of continuous random variables, we know that P (X x) but we can find the marginal cdf x ( ) F X (x) P (X x) f X,Y (t, y) dy dt (1) and then take the derivative to get the marginal pdf and likewise f X (x) F X(x) f Y (y) f X,Y (x, y) dy (11) f X,Y (x, y) dx (1) 5

The act of summing or integrating over arguments we don t care about, in order to get a marginal probability distribution, is called marginalizing Thursday 1 September 15 Read Section of Hogg 13 Expectation Values We can define the expectation value of a function of two discrete random variables in the straightforward way E[g(X 1, X )] x 1,x g(x 1, x ) p(x 1, x ) (13) and for two continuous random variables E[g(X 1, X )] g(x 1, x ) f(x 1, x ) dx 1 dx (14) In each case, we only consider the expectation value to be defined if the relevant sum or integral converges absolutely, ie, if E( g(x 1, X ) ) < Note that the expectation value is still linear, ie, E[k 1 g 1 (X 1, X )+k g (X 1, X )] k 1 E[g 1 (X 1, X )]+k E[g (X 1, X )] (15) 131 Moment Generating Function In the case of a pair of random variables, we can define the mgf as M(t 1, t ) E ( ( ) e t 1X 1 +t X ) ( )}] t1 X1 ( ) E [exp E e tt X t X (16) where t T is the transpose of the column vector t We can get the mgf for each of the random variables from the joint mgf M X1 (t 1 ) M X1,X (t 1, ) and M X (t ) M X1,X (, t ) (17) We ll actually mostly use the mgf as an easy way to identify the distribution, but it can also be used to generate moments in the usual way: m E[X 1 m 1 X ] m 1+m M(t m 1 t1 m 1, t ) t 14 Transformations (t1,t )(,) (18) We turn now to the question of how to transform the joint distribution function under a change of variables In order for the distribution of Y 1 u 1 (X 1, X ) and Y u (X 1, X ) to carry the same information as the distribution of X 1 and X, the transformation should be invertable over the space of possible X 1 and X values, ie, we should be able to write X 1 w 1 (Y 1, Y ) and X w (Y 1, Y ) 141 Transformation of Discrete RVs For the case of a pair of discrete random variables, things are very straightforward, since p Y1,Y (y 1, y ) P ([Y 1 y 1 ] [Y y ]) P ([u 1 (X 1, X ) y 1 ] [u (X 1, X ) y ]) P ([X 1 w 1 (Y 1, Y )] [X w (Y 1, Y )]) p X1,X (w 1 (y 1, y ), w (y 1, y )) (19) 6

For example, suppose ( 1 x1 +x p X1,X (x 1, x ) ) x1, 1,, ; x, 1,, otherwise (13) If we define Y 1 X 1 +X and Y X 1 X then X 1 Y 1+Y and X Y 1 Y The only tricky part is figuring out the allowed set of values for Y 1 and Y We note that y 1+y and y 1 y imply that, for a given y 1, y 1 y y 1 That s not quite the whole story, though, since y 1+y and y 1 y also have to be integers, so if y 1 is odd, y must be odd, and if y 1 is even, y must be even It s easiest to see what combinations are allowed by building a table for the first few values: x 1 x (y 1, y ) 1 3 1 (1, 1) (, ) (3, 1) (4, ) (, ) (3, 1) (4, ) (5, 1) (, ) (1, 1) (, ) (3, 3) 3 (3, 3) (4, ) (5, 1) (6, ) So evidently y 1 can be any non-negative integer, and the possible values for y count by twos from y 1 to y 1 ( 1 y1 y 1, 1,, ; p Y1,Y (y 1, y ) ) y y 1, y 1 +,, y 1, y 1 otherwise (131) 14 Transformation of Continuous RVs For the case of the transformation of continuous random variables we have to deal with the fact that f X1,X (x 1, x ) and f Y1,Y (y 1, y ) are probability densities and the volume (area) element has to be transformed from one set of variables to the other If we write f X1,X (x 1, x ) and f Y1,Y (y 1, y ) d P dy 1 dy, the transformation we ll need is d P dy 1 dy d P dx 1 dx det (x 1, x ) (y 1, y ) d P dx 1 dx (13) where we use the determinant of the Jacobian matrix (y 1, y ) (x 1, x ) ( y1 x 1 y x y y 1 x 1 x ) (133) which may be familiar from the transformation of the volume element dy 1 dy det (y 1, y ) (x 1, x ) dx 1 dx (134) if we change variables in a double integral To get a concrete handle on this, consider an example Let X and Y be continuous random variables with a joint pdf f X,Y (x, y) 4 π e x y otherwise < x < ; < y < (135) If we want to calculate the probability that X + Y < a we have to integrate over the part of this disc which lies in the first quadrant x >, y > (where the pdf is non-zero): 7

y a a x The limits of the x integral are determined by < x and x + y < a, ie, x < a y ; the range of y values represented can be seen from the figure to be < y < a, so we can write the probability as P (X + Y < a ) a a y 4 y e x dx dy (136) π but we can t really do the integral in this form However, if we define random variables R X + Y and 1 Φ tan 1 (Y /X), so that X R cos Φ and Y R sin Φ, we can write the probability as P (X +Y <a ) P (R<a) π/ a f R,Φ (r, φ) dr dφ (138) if we have the transformed pdf f R,Φ (r, φ) On the other hand, we know that we can write the volume element dx dy r dr dφ We 1 Note that we can only get away with using the arctangent tan 1 (y/x) as an expression for φ because x and y are both positive In general, we need to be careful; (x, y) ( 1, 1) corresponds to φ 3π/4 even though tan 1 ([ 1]/[ 1]) tan 1 (1) π/4 if we use the principal branch of the arctangent For a general point in the (x, y) plane, we d need to use the can get this either from geometry in this case, or more generally by differentiating the transformation ( ) ( ) x r cos φ (139) y r sin φ to get ( ) ( ) ( ) ( ) dx cos φ dr r sin φ dφ cos φ r sin φ dr dy sin φ dr + r cos φ dφ sin φ r cos φ dφ (14) and taking the determinant of the Jacobian matrix: (x, y) det (r, φ) cos φ r sin φ sin φ r cos φ r cos φ + r sin φ r (141) so the volume element transforms like dx dy (x, y) det (r, φ) dr dφ r dr dφ (14) Even if we knew nothing about the transformation of random variables, we could use this to change variables in the integral (136) to get a function a y 4 y e x dx dy π π/ a 4 π e r r dr dφ (143) tan 1 (y/x) π x < and y < π/ x and y < atan(y, x) tan 1 (y/x) x > π/ x and y > tan 1 (y/x) + π x < and y φ atan(y, x) to get the correct φ [ π, π) (137) 8

If we compare the integrands of (143) and (143) we can see that the transformed pdf must be f R,Φ (r, φ) r e r < r < ; < φ < π/ otherwise Incidentally, we can calculate the probability as P (R < a) π/ a 4 π e r r dr dφ a (144) e r r dr e r a 1 e a (145) To return to the general case, we see there are basically two things to worry about: one is the Jacobian determinant relating the volume elements in the two sets of variables, and the other is transforming the ranges of variables used to describe the event, as well as the allowed range of variables In general terms, if S is the support of the random variables X 1 and X, ie, the smallest region of R such that P [(X 1, X ) S] 1 and T is the support of Y 1 and Y, we need a transformation of the pdf f X1,X (x 1, x ) defined on S such that P [(X 1, X ) A] f X1,X (x 1, x ) dx 1 dx B A f Y1,Y (y 1, y ) dy 1 dy P [(Y 1, Y ) B] (146) where B is the image of A under the transformation, ie, (x 1, x ) A is equivalent to u 1 (x 1, x ), u (x 1, x )} B Since a change of variables in the integral gives us A f X1,X (x 1, x ) dx 1 dx f X1,X (w 1 (y 1, y ), w (y 1, y )) det (x 1, x ) (y 1, y ) dy 1 dy B we must have, in general, f Y1,Y (y 1, y ) det (x 1, x ) (y 1, y ) f X 1,X (w 1 (y 1, y ), w (y 1, y )) (147) (y 1, y ) T (148) which is the more careful way of writing the easier-to-remember formula we started with: d P dy 1 dy det (x 1, x ) d P (y 1, y ) (149) dx 1 dx Tuesday 15 September 15 Read Section 3 of Hogg Conditional Distributions 1 Conditional Probability Recall the definition of conditional probability: for events C 1 and C, P (C C 1 ) is the probability of C given C 1 If we recall that P (C) is the fraction of repeated experiments in which C is true, we can think of P (C C 1 ) as follows: restrict attention to those experiments in which C 1 is true, and take the fraction in which C is also true This conceptual definition leads to the 9

mathematical definition P (C C 1 ) P (C 1 C ) P (C 1 ) (1) A consequence of this definition is the multiplication rule for probabilities, P (C 1 C ) P (C C 1 )P (C 1 ) () This means that the probability of C 1 and C is the probability of C 1 times the probability of C given C 1, which makes logical sense In fact, since one often has easier access to conditional probabilities in the first place, you could start with the definition of P (C C 1 ) as the probability of C assuming C 1, and then use the multiplication rule () as one of the basic tenets of probability An extreme expression of this philosophy says that all probabilities are conditional probabilities, since you have to assume something about a model to calculate them One simple consequence of the multiplication rule is that we can write P (C 1 C ) two different ways: P (C 1 C )P (C ) P (C 1 C ) P (C C 1 )P (C 1 ) (3) dividing by P (C ) gives us Bayes s theorem P (C 1 C ) P (C C 1 )P (C 1 ) P (C 1 ) (4) which is useful if you want to calculate conditional probabilities with one condition when you know them with another condition See E T Jaynes Probability Theory: The Logic of Science for this approach Conditional Probability Distributions Given a pair of discrete random variables X 1 and X with joint pmf p X1,X (x 1, x ), we can define in a straightforward way the conditional probability that X takes on a value given a value for X 1 : p X X 1 (x, x 1 ) P (X 1 x 1 X x ) p X 1,X (x 1, x ) p X1 (x 1 ) where we ve used the marginal pmf (5) p X1 (x 1 ) x p X1,X (x 1, x ) (6) We often write p 1 (x x 1 ) as a shorthand for p X X 1 (x x 1 ) Note that conditional probability distributions are normalized just like ordinary ones: p(x 1, x ) x p 1 (x x 1 ) p(x 1, x ) p 1(x 1 ) p x x 1 (x 1 ) p 1 (x 1 ) p 1 (x 1 ) 1 (7) If we have a pair of continuous random variables with joint pdf f X1,X (x 1, x ), we d like to similarly define f 1 (x x 1 ) lim ξ P (x 1 ξ < X 1 x 1 X x ) (8) But there s a problem: since X is a continuous random variable, P (X x ), which means we can t divide by it So instead, we have to definite it as f 1 (x x 1 ) lim P (x 1 ξ 1 <X 1 x 1 x +ξ <X x ) f(x 1, x ), ξ1 f 1 (x 1 ) ξ (9) 1

where f 1 (x 1 ) f(x 1, x ) dx is the marginal pdf Again, the conditional pdf is properly normalized: f 1 (x x 1 ) dx f(x 1, x ) dx f 1 (x 1 ) f 1(x 1 ) f 1 (x 1 ) 1 (1) Note that f 1 (x x 1 ) is a density in x, not in x 1 This is also important in the continuous equivalent of Bayes s theorem: 1 Example f 1 (x 1 x ) f 1(x x 1 )f 1 (x 1 ) f (x ) (11) Consider continuous random variables X 1 and X with joint pdf The marginal pdf for x 1 is f 1 (x 1 ) f(x 1, x ) 6x, < x < x 1 < 1 (1) x1 so the conditional pdf is f 1 (x x 1 ) f(x 1, x ) f 1 (x 1 ) which we can see is normalized: f 1 (x x 1 ) dx 6x dx 3x 1, < x 1 < 1 (13) x, < x x < x 1 < 1 (14) 1 x1 Conditional Expectations x dx x x 1 1 x 1 1 (15) Since conditional pdfs or pmfs are just like regular probability distributions, you can also use them to define expectation values For discrete random variables X 1 and X we can define E(u(X ) X 1 x 1 ) E(u(X ) x 1 ) x u(x )p 1 (x x 1 ) and for continuous: E(u(X ) X 1 x 1 ) E(u(X ) x 1 ) if if x u(x ) p 1 (x x 1 ) < (16) This is still a linear operation, so u(x )p 1 (x x 1 ) dx u(x ) p 1 (x x 1 ) dx < (17) E(k 1 u 1 (X ) + k u (X ) x 1 ) k 1 E(u 1 (X ) x 1 ) + E(u (X ) x 1 ) (18) We can define a conditional variance by analogy to the usual variance: Var(X x 1 ) E[X E(X x 1 )] x 1 } (19) and since the conditional expectation value is linear, we have the usual shortcut Var(X x 1 ) E(X x 1 ) [E(X x 1 )] () Returning to our example, in which f 1 (x x 1 ) x, < x 1 x < x 1 < 1, we have and E(X x 1 ) E(X x 1 ) x1 x1 x x dx x 1 3 x 1 (1) x x dx x 1 1 x 1 () 11

so Var(X x 1 ) 1 x 1 ( ) 3 x 1 x 1 18 (3) Note that E(X x 1 ) is a function of x 1 and not a random variable But we can insert the random variable X 1 into that function, and define a random variable E(X X 1 ) which is equal to E(X x 1 ) when X 1 x 1 This random variable can also be written Note that E[E(X X 1 )] E(X X 1 ) x f 1 (X 1, x ) dx (4) E(X x 1 )f 1 (x 1 ) dx 1 x f 1 (x x 1 )f 1 (x 1 ) dx dx 1 x f(x 1, x ) dx dx 1 E[X ] (5) So E(X X 1 ) is an estimator of E(X ) It can be shown that Var[E(X X 1 )] Var(X ) (6) so E(X X 1 ) is potentially a better estimator of the mean E(X ) than X itself is (This isn t exactly a practical procedure, though, since to evaluate the function E(X x 1 ) you need the conditional probability density f 1 (x x 1 ) for all possible x ) In our specific example, since E(X x 1 ) x 3 1, E(X X 1 ) X 3 1 We can work out ( ) E[E(X x 1 )] E 3 X 1 1 3 x 1(3x 1) dx 1 1 3 x 1f 1 (x 1 ) dx 1 (7) and ( ) 4 E[E(X X 1 )} ] E 9 X 1 so that 1 Var[E(X X 1 )] 4 15 1 4 4 9 x 1(3x 1) dx 1 4 15 16 15 6 4 9 x 1f 1 (x 1 ) dx 1 1 6 To get E(X ) and Var(X ) we need the marginal pdf f (x ) 1 (8) (9) x 6x dx 1 6x (1 x ), < x < 1 (3) from which we calculate 1 ( 1 E(X ) (x )6x (1 x ) dx 6 3 1 ) 6 4 1 1 (31) and 1 ( 1 E(X ) (x )6x (1 x ) dx 6 4 1 ) 6 5 3 1 (3) so Var(X ) 3 1 1 4 6 5 1 (33) from which we can verify that in this case and E(X X 1 ) 1 E(X ) (34) Var[E(X X 1 )] 1 6 1 Var(X ) (35) 1

Thursday 17 September 15 Read Sections 4-5 of Hogg 3 Independence Recall conditional distribution for discrete rvs X 1 and X p 1 (x x 1 ) P (X 1 x 1 X x ) P ([X 1 x 1 ] [X x ]) P (X 1 x 1 ) or for continuous rvs X 1 and X Consider the example f 1 (x x 1 ) f(x 1, x ) f(x 1 ) p(x 1, x ) p(x 1 ) (36) (37) f(x 1, x ) 6x 1 x < x 1 < 1, < x < 1 (38) The marginal pdf for X 1 is f 1 (x 1 ) 1 which makes the conditional pdf 6x 1 x dx x 1 (39) f 1 (x x 1 ) 6x 1x x 1 3x < x 1 < 1, < x < 1 (4) Note that in this case f 1 (x x 1 ) doesn t actually depend on x 1, as long as x 1 is in the support of the random variable X 1 This situation is called independence In fact, it s easy to show that in this situation f 1 (x x 1 ) f (x ), ie, the conditional pdf for X given any possible value of X 1 is the marginal pdf for X : f (x ) f 1 (x x 1 ) f(x 1, x ) dx 1 f 1 (x x 1 ) f 1 (x 1 ) dx 1 f 1 (x 1 ) dx 1 f 1 (x x 1 ) (X 1 & X indep) (41) where we can pull f 1 (x x 1 ) out of the x 1 integral because it doesn t actually depend on x 1, and we use the fact that the marginal pdf f 1 (x 1 ) is normalized We thus have the definition (X 1 & X independent) ( f 1 (x x 1 ) f (x ) for all (x 1, x ) S ) (4) This is not the most symmetric definition, and it s not immediately obvious that f 1 (x x 1 ) f (x ) implies f 1 (x 1 x ) f 1 (x 1 ) But it does because of the following result (X 1 & X independent) iff f(x 1, x ) f 1 (x 1 )f (x ) for all (x 1, x ) (43) (We don t need to specify (x 1, x ) S because we can think of f 1 (x 1 ) and f (x ) as being equal to zero if their arguments are outside their respective support spaces) It s easy enough to demonstrate (43) If we assume f(x 1, x ) f 1 (x 1 )f (x ), then f 1 (x x 1 ) f 1(x 1 )f (x ) f 1 (x 1 f ) (x ) as long as f 1 (x 1 ) Conversely, if we assume f 1 (x x 1 ) f (x ), then f(x 1, x ) f 1 (x x 1 )f 1 (x 1 ) f 1 (x 1 )f (x ) If X 1 and X are not independent, we call them dependent random variables We can consider a couple of examples of dependent rvs: 31 Dependent rv example #1 First, let Then f(x 1, x ) f 1 (x 1 ) 1 x 1 + x < x 1 < 1, < x < 1 otherwise (x 1 + x ) dx x 1 + 1 (44) < x 1 < 1 (45) 13

and f 1 (x x 1 ) x 1 + x x 1 + 1 < x 1 < 1, < x < 1 (46) which does depend on x 1, so X 1 and X are dependent 3 Dependent rv example #1 Second, return to our example from Tuesday, where and we saw f(x 1, x ) 6x, < x < x 1 < 1 (47) f 1 (x x 1 ) x, < x x < x 1 < 1 (48) 1 again, this depends on x 1, so X 1 and X are dependent 33 Factoring the joint pdf We don t have to calculate the conditional pdf to tell whether random variables are dependent or independent We can show that (X 1 & X independent) iff f(x 1, x ) g(x 1 )h(x ) for all (x 1, x ) (49) for some functions g and h The only if part is trivial; choose g(x 1 ) f 1 (x 1 ) and h(x ) f (x ) We can show the if part by assuming a factored form and working out f 1 (x 1 ) g(x 1 )h(x ) dx g(x 1 ) h(x ) dx (5) The integral h(x ) dx is just a constant, which we can call c, so we have g(x 1 ) f 1 (x 1 )/c and f(x 1, x ) f 1 (x 1 )h(x )/c Then we take f (x ) h(x ) c f 1 (x 1 ) dx 1 h(x ) c (51) which means that indeed f(x 1, x ) f 1 (x 1 )f (x ) Our two examples show ways in which the joint pdf can fail to factor In (44), x 1 + x can obviously not be written as a product of g(x 1 ) and g(x ) In (47), it s a little trickier, since it seems like we could write 6x 1 (6x 1 )(1) But the problem is the support of (47) If we took, for example, 6x 1 < x 1 < 1 g(x 1 ) (5) otherwise and we d end up with g(x 1 )h(x ) h(x ) 1 < x < 1 otherwise 6x 1 < x 1 < 1, < x < 1 otherwise (53) (54) which is not the same as the f(x 1, x ) given in (47) In general, for the factorization to work, the support of X 1 and X has to be a product space, ie, the intersection of a range of possible x 1 values with no reference to x and a range of possible x values with no reference to x 1 Some examples of product spaces are < x 1 < 1, < x < 1 1 < x 1 < 1, < x < < x 1 <, < x < < x 1 <, < x < 1 some examples of non-product spaces are < x < x 1 < 1 < x 1 < x < x 1 + x < 1 14

34 Expectation Values Finally we consider an important result related to expectation values Let X 1 and X be independent random variables Then the expectation value of the product of a function of each random variables is the product of their expectation values: E[u 1 (X 1 )u (X )] u 1 (x 1 )u (x ) f 1 (x 1 )f (x ) dx 1 dx ( ) ( ) u 1 (x 1 ) f 1 (x 1 ) dx 1 u (x ) f (x ) dx E[u 1 (X 1 )]E[u (X )] (X 1 & X indep) (55) In particular, the joint mgf is such an expectation value, so M(t 1, t ) E ( e t 1X 1 +t X ) E ( e t 1 X 1 ) E ( e t X ) M1 (t 1 )M (t ) M(t 1, )M(, t ) (X 1 & X indep) (56) It takes a little more work, but you can also prove the converse (see Hogg for details) so (X 1 & X independent) iff M(t 1, t ) M(t 1, )M(, t ) (57) You showed on the homework that in a particular case M(t 1, )M(, t ) M(t 1, t ); in that case the random variables were dependent because their support was not a product space 3 Covariance and Correlation Recall the definitions of the means µ X E(X) and µ Y E(Y ) (31) and variances We can define the covariance σ X Var(X) E([X µ X ] ) σ Y Var(Y ) E([Y µ Y ] ) (3a) (3b) Cov(X, Y ) E([X µ X ][Y µ Y ]) (33) Dimensionally, µ X and σ X have units of X, µ Y and σ Y have units of Y, and the covariance Cov(X, Y ) has units of XY It s useful to define a dimensionless quantity called the Correlation Coëfficient: ρ Cov(X, Y ) σ X σ Y (34) On the homework you will show that 1 ρ 1 One important result about independent random variables is that they are uncorrelated If X and Y are independent, then Cov(X, Y ) E([X µ X ][Y µ Y ]) E(X µ X )E(Y µ Y ) (µ X µ X )(µ Y µ Y ) (X & Y indep) (35) On the other hand, the converse is not true: it is still possible for the covariance of dependent variables to be zero Tuesday September 15 Read Section 6 of Hogg 4 Generalization to Several RVs Note: this week s material will not be included on Prelim Exam One 15

41 Linear Algebra: Reminders and Notation vector (ie, an m 1 matrix): If A is an m n matrix: A 11 A 1 A 1n A 1 A A n A A m1 A m A mn (41) w 1 w w m A 11 A 1 A 1n v 1 w Av A 1 A A n v A m1 A m A mn v n A 11 v 1 + A 1 v + + A 1n v n A 1 v 1 + A v + + A n v n (45) and B is an n p matrix, A m1 v 1 + A m v + + A mn v n B 11 B 1 B 1p B 1 B B p B B n1 B n B np (4) so that w i n j1 A ijv j If u is an n-element column vector, then u T is an n-element row vector (a 1 n matrix): u T ( u 1 u u n ) (46) then their product C AB is an m p matrix as shown in Figure 1 so that C ik n j1 A ijb jk If A is an m n matrix, B A T is an n m matrix with elements B ij A ji : B 11 B 1 B 1m A 11 A 1 A m1 B 1 B B m B A 1 A A m AT B n1 B n B nm A 1n A n A mn (44) If v is an n-element column vector (which is an n 1 matrix) and A is an m n matrix, w Av is an m-element column If u and v are n-element column vectors, u T v is a number, known as the inner product: v 1 u T v ( ) v u 1 u u n v n u 1 v 1 + u v + + u n v n n u i v i i1 (47) If v is an m-element column vector, and w is an n-element 16

C 11 C 1 C 1p A 11 A 1 A 1n B 11 B 1 B 1p C 1 C C p C AB A 1 A A n B 1 B B p C m1 C m C mp A m1 A m A mn B n1 B n B np A 11 B 11 + A 1 B 1 + + A 1n B n1 A 11 B 1 + A 1 B + + A 1n B n A 11 B 1p + A 1 B p + + A 1n B np A 1 B 11 + A B 1 + + A n B n1 A 1 B 1 + A B + + A n B n A 1 B 1p + A B p + + A n B np A m1 B 11 + A m B 1 + + A mn B n1 A m1 B 1 + A m B + + A mn B n A m1 B 1p + A m B p + + A mn B np (43) Figure 1: Expansion of the product C AB to show C ik n j1 A ijb jk column vector, A vw T is an m n matrix A 11 A 1 A 1n A 1 A A n A vwt A m1 A m A mn v 1 v 1 w 1 v 1 w v 1 w n v ( ) v w 1 v w v w n w1 w w m v m v m w 1 v m w v m w n (48) so that A ij v i w j If M and N are n n matrices, the determinant det(mn) det(m) det(n) If M is an n n matrix (known as a square matrix), the inverse matrix M 1 is defined by M 1 M 1 n n MM 1 where 1 n n is the identity matrix 1 1 1 n n 1 (49) If M 1 exists, we say M is invertible If M is a real, symmetric n n matrix, so that M T M, ie, M ji M ij, there is a set of n orthonormal eigenvectors v 1, v,, v n } with real eigenvalues λ 1, λ,, λ n }, so that Mv i λ i v i Orthonormal means vi T i j v j δ ij (41) 1 i j where we have introduced the Kronecker delta symbol δ ij The eigenvalue decomposition means n M λ i v i vi T (411) i1 17

The determinant is det(m) n i1 λ i If none of the eigenvalues λ i } are zero, M is invertible, and the inverse matrix is M 1 n i1 1 λ i v i v T i (41) If all of the eigenvalues λ i } are positive, we say M is positive definite If none of the eigenvalues λ i } are negative, we say M is positive semi-definite Note that these conditions are equivalent to the more common definition: M is positive definite if v T Mv > for any non-zero n-element column vector v and positive semi-definite if v T Mv for any n-element column vector v 4 Multivariate Probability Distributions Many of the definitions which we considered in detail for the case of two random variables generalize in a straightforward way to the case of n random variables It s notationally convenient to use the concept of a random vector X as in (11), so for example, the joint cdf is F X (x) P ([X 1 x 1 ] [X x ] [X n x n ]) (413) In the discrete case, the joint pmf is p X (x) P ([X 1 x 1 ] [X x ] [X n x n ]) (414) while in the continuous case the joint pdf is f X (x) n x 1 x n F X (x) lim ξ1 ξ n P ([x 1 ξ 1 < X 1 x 1 ] [x n ξ n < X n x n ]) ξ 1 ξ n (415) The probability for the random vector X to lie in some region A R n is P (X A) f X (x) dx 1 dx n (416) 41 Marginal and Conditional Distributions A One place where the multivariate case can be more complicated than the bivariate one is marginalization Given a joint distribution n random variables, you can in principle marginalize over m of them, and be left with a marginal distribution for the remaining n m variables To give a concrete example, consider three random variables X i i 1,, 3} with joint pdf 6 < x 1 < x < x 3 < 1 f X (x) (417) otherwise We can marginalize over X 3, which gives us a pdf for X 1 and X (ie, still a joint pdf); f X1 X (x 1, x ) f X1 X X 3 (x 1, x, x 3 ) dx 3 6(1 x ) < x 1 < x < 1 otherwise 1 x 6 dx 3 by similar calculations, we can marginalize over X to get x3 (418) f X1 X 3 (x 1, x 3 ) f X1 X X 3 (x 1, x, x 3 ) dx 6 dx x 1 6(x 3 x 1 ) < x 1 < x 3 < 1 otherwise (419) 18

or over X 1 to get f X X 3 (x, x 3 ) f X1 X X 3 (x 1, x, x 3 ) dx 1 6x < x < x 3 < 1 otherwise x 6 dx 1 (4) If we want the marginal distribution for X 1, we marginalize over both X and X 3 : f X1 (x 1 ) f X1 X X 3 (x 1, x, x 3 ) dx dx 3 1 f X1 X (x 1, x ) dx 6(1 x ) dx x 1 3 (1 x ) x 1 x x 1 3(1 x 1 ) < x 1 < 1 (41) of course, if we marginalize in the other order, we get the same result: 1 f X1 (x 1 ) f X1 X 3 (x 1, x 3 ) dx 3 6(x 3 x 1 ) dx 3 x 1 3 (x 3 x 1 ) x 31 x 3 x 1 3(1 x 1 ) < x 1 < 1 (4) Similarly, we can marginalize over X 1 and X 3 to get and f X (x ) 1 f X3 (x 3 ) x 6x dx 3 6x (1 x ) < x < 1 (43) x3 6x dx 3x 3 < x 3 < 1 (44) The various joint and marginal distributions can be combined into conditional distributions such as f 13 (x 1, x 3 x ) f 13(x 1, x, x 3 ) f (x ) and f 1 3 (x 1 x, x 3 ) f 13(x 1, x, x 3 ) f 3 (x, x 3 ) and f 1 (x 1 x ) f 1(x 1, x ) f (x ) 1 x (1 x ) < x 1 < x < x 3 < 1 (45) 1 x < x 1 < x < x 3 < 1 (46) 1 x < x 1 < x < 1 (47) 4 Mutual vs Pairwise Independence The notion of independence carries over to multivariate distributions as well We say that a set of n random variables X 1,, X n are mutually independent if we can write f(x 1, x,, x n ) f 1 (x 1 ) f (x ) f n (x n ) (48) for all x 3 You can show straightforwardly that this implies f ij (x i, x j ) f i (x i ) f j (x j ) for each i and j, ie, any pair of the random variables is independent This is known as pairwise independence However, pairwise independence doesn t imply mutual independence of the whole set There is a simple counterexample from S N Bernstein 4 Suppose you have a tetrahedron (fair four-sided die) on which one face is painted all black, 3 Except possibly at some points whose total probability is zero 4 Hogg doesn t actually give the Bernstein reference, just says this is attributed to him The original reference is to a book from 1946 which is in Russian But there is a more recent summary in Stȩpniak, The College Mathematics Journal 38, 14-14 (7), which is available at http:// wwwjstororg/stable/764645 if you re on campus and http://www jstororgezproxyritedu/stable/764645 if you re not 19

one all blue, one all red, and one painted with all three colors Throw it and note the color or colors of the face it lands on Define X 1 to be 1 if the face landed on is at least partly black, if it isn t, X to be 1 if the face is at least partly blue and if not, and X 3 to be 1 if the face is part or all red, if not Any two of these random variables are independent For example, of the four faces, one is all black, one is all blue, one is both black and blue, and one is neither black nor blue, so the joint pmf for X 1 and X is x p 1 (x 1, x ) 1 p 1 (x 1 ) x 1 1/4 1/4 1/ 1 1/4 1/4 1/ p (x ) 1/ 1/ On the other hand, it is clear that p 13 (x 1, x, x 3 ) p 1 (x 1 ) p (x ) p 3 (x 3 ) since the right-hand side would be 1/8 for each combination of x 1, 1}, x, 1}, x 3, 1}, rather than the true pmf, which is 1/4 x 1, x, x 3 1 1/4 x 1, x 1, x 3 p 13 (x 1, x, x 3 ) 1/4 x 1 1, x, x 3 1/4 x 1 1, x 1, x 3 1 otherwise Thursday 4 September 15 Read Sections 7-8 of Hogg (49) Note: this week s material will not be included on Prelim Exam One 43 The Variance-Covariance Matrix Recall that the covariance of random variables X i and X j with means µ i E (X i ) and µ j E (X j ) is Cov(X i, X j ) E ([X i µ i ][X j µ j ]) (43) In particular, the covariance of one of the rvs with itself is its variance Cov(X i, X i ) E ( [X i µ i ] ) Var(X 1 ) (431) We can define a matrix Σ whose elements 5 σ ij Cov(X i, X j ) (43) are the covariances between the various rvs, with the diagonal elements being equal to the variances We can write this in matrix notation by first defining µ E (X) (433) ie, a column vector whose ith element is µ i E (X i ) Then X µ is a column vector whose ith element is X i µ i Looking at (43) and (43) we see that the elements of the matrix Σ are the expectation values of the elements of the matrix whose i, j element is [X i µ i ][X j µ j ] This matrix is X 1 µ 1 [X µ][x µ] T X µ ( ) X1 µ 1 X µ X n µ n X n µ n (434) 5 This is a really unfortunate notational choice by Hogg, if you ask me, since the variance of a single random variable is written σ, not σ, and eg, we re defining σ ii (σ i ) Among other things, if the random vector X has physical units, σ i and σ ij have different units

[X X 1 µ 1 µ 1 ] [X 1 µ 1 ][X µ ] [X 1 µ 1 ][X n µ n ] 1 X µ ( ) [X 1 µ 1 ][X µ ] [X µ ] [X µ ][X n µ n ] X1 µ 1 X µ X n µ n X n µ n [X 1 µ 1 ][X n µ n ] [X µ ][X n µ n ] [X n µ n ] (435) Figure : Elements of the random matrix [X µ][x µ] T ie, the outer product of the random vector X µ with itself, whose elements are spelled out in (435) in Figure (Note that this is a random matrix, a straightforward generalization of a random vector) Thus the variance-covariance matrix, which we write as Cov(X), is Cov(X) Σ E ( [X µ][x µ] T) (436) Note that Cov(X) must be a positive definite matrix This can be shown by considering v T Cov(X)v E ( v T [X µ][x µ] T v ) (437) Now, the inner product v T [X µ] is a single random variable (or, if you like, a 1 1 matrix), and so if we call that random variable Y, the fact that the inner product is symmetric tells us that [X µ] T v Y v T [X µ] (438) and so v T Cov(X)v E ( Y ) (439) 44 Transformations of Several RVs Consider the case where we have n random variables X 1, X,, X n with a joint pdf f X (x) and wish to obtain the joint pdf f Y (y) for n functions Y 1 u 1 (X), Y u (X),, Y n u n (X) of those random variables (We can summarize this as Y u(x)) For simplicity, we assume the transformation is invertible, so that we can write X 1 w 1 (Y), X w (Y),, X n w n (Y), or equivalently X w(y) (See Hogg for a discussion of the case where the transformation isn t single-valued) We refer to the sample space for X as S, and the transformation of that space (the sample space for Y ) as T Consider a subset A S whose transformation is B T Then X A is the same event as Y B, so A f X (x) dx 1 dx n P (X A) P (Y B) B f Y (y) dy 1 dy n (44) We can change variables in the first form of the integral, with the result that A is transformed into B, and the volume element becomes ( ) dx 1 dx n x det dy 1 dy n (441) y 1

where x y is the Jacobian matrix x y x 1 x 1 y 1 x x y 1 x n y 1 so we can write f X (x) d n x A x 1 y n x y n y y x n y x n y n B B ( ) f X (w(y)) x det d n y y f Y (y) d n y (44) (443) In order for this equality to hold for any region A, the integrands must be equal everywhere, ie, 45 Example #1 Consider the joint pdf 6 f X (x 1, x, x 3 ) f Y (y) f X (w(y)) x det y (444) 1x 1 x < x 1, x, x 3, x 1 + x + x 3 < 1 otherwise (445) 6 Incidentally, this is a real example, a Dirichlet distribution with parameters,, 1, 1} and define the transformation Y 1 Y X 1 X 1 + X + X 3 X X 1 + X + X 3 Y 3 X 1 + X + X 3 which has the inverse transformation X 1 Y 1 Y 3 X Y Y 3 X 3 (1 Y 1 Y )Y 3 (446a) (446b) (446c) (447a) (447b) (447c) The Jacobian matrix of the transformation is y x 3 y 1 y y 3 y (448) y 3 y 3 (1 y 1 y ) with determinant det x y y 3 y 1 y 3 y y 3 y 3 (1 y 1 y ) y 3(1 y 1 y ) ( y 3y 1 y 3y ) y 3 The sample space for X is (449) S < x 1, < x, < x 3, x 1 + x + x 3 < 1} (45) so we need to find the corresponding support space for Y T < y 1 y 3, < y y 3, < (1 y 1 y )y 3, y 3 < 1} (451) To see how we can satisfy this, first note that the first three conditions imply < y 3 This is because, in order to satisfy

the first two, we have to have y 1, y and y 3 all positive or all negative But if they are all negative then (1 y 1 y )y 3 cannot be positive Since we thus know < y 3, the first two conditions imply < y 1 and < y The third condition then implies < 1 y 1 y, ie, y 1 + y < 1 The necessary and sufficient conditions to describe the transformed sample space are thus T < y 1, < y, y 1 + y < 1, < y 3 < 1} (45) Putting it all together, we have a pdf of f Y (y 1, y, y 3 ) f X (y 1 y 3, y y 3, [1 y 1 y ]y 3 ) y3 1 y 1 y y3 4 < y 1, < y, y 1 + y < 1, < y 3 < 1 otherwise (453) 46 Example # (fewer Y s than Xs) It s also possible to consider transformations between different numbers of random variables Ie, we can start with f X1 X n (x 1,, x n ) and use it to obtain f Y1 Y m (y 1,, y m ) given transformations Y j u j (X 1,, X n ) even when m n If m > n it s a bit tricky, since the distribution for the Y j } is degenerate But if m < n, its relatively straightforward You considered the case where n and m 1 by a different approach, using the joint pdf f X1 X (x 1, x ) to obtain F Y (y) P (u(x 1, x ) y) and differentiating, but we consider a more general technique here We can make up an additional n m random variables Y m+1,, Y n }, use the expanded transformation to obtain f Y1 Y n (y 1,, y n ), and then integrate out marginalize over these n m variables to obtain the marginal pdf f Y1 Y m (y 1,, y m ) Any set of additional functions u m+1 (x 1,, x n ),, u n (x 1,, x n )} will work, as long as the full transformation we end up with is invertible As an example, return to the pdf (445), but consider the transformation Y 1 Y X 1 X 1 + X + X 3 X X 1 + X + X 3 (454a) (454b) and obtain the joint pdf f Y1 Y (y 1, y ) Well, in this case we already know a convenient choice, Y 3 X 1 + X + X 3 We have thus already done most of the problem, and just need to marginalize (453) over Y 3 to obtain the pdf 7 f Y1 Y (y 1, y ) f Y1 Y Y 3 (y 1, y, y 3 ) dy 3 1 1 y 1 y y3 4 4 y 1 y < y 1, < y, y 1 + y < 1 dy 3 otherwise (455) Tuesday 9 September 15 Review for Prelim Exam One The exam covers materials from the first four weeks of the term, ie, Hogg sections 15-11 and 1-5, and problem sets 1-4 Thursday 1 October 15 First Prelim Exam 7 This turns out to be a Dirichlet distribution as well, this time with parameters,, 1} 3