Some Special Distributions (Hogg Chapter Three)

Size: px

Start display at page:

Download "Some Special Distributions (Hogg Chapter Three)"

Marilyn Bryan
6 years ago
Views:

1 Some Special Distributions Hogg Chapter Three STAT 45-: Mathematical Statistics I Fall Semester 3 Contents Binomial & Related Distributions The Binomial Distribution Some advice on calculating n 3 k Properties of the Binomial Distribution 4 The Multinomial Distribution 5 3 More on the Binomial Distribution 5 4 The Negative Binomial Distribution 6 5 The Hypergeometric Distribution 7 The Poisson Distribution 7 Aside: Simulating a Poisson Process 8 Poisson Process and Rate 8 3 Connection to Binomial Distribution and Derivation of pmf 8 3 Gamma and Related Distributions 3 Gamma Γ Distribution 3 Exponential Distribution 33 Chi-Square χ Distribution 3 Copyright 3, John T Whelan, and all that 34 Beta β Distribution 3 4 The Normal Distribution 3 5 Multivariate Normal Distribution 6 5 Linear Algebra: Reminders and Notation 6 5 Special Case: Independent Gaussian Random Variables 8 53 Multivariate Distributions 9 54 General Multivariate Normal Distribution 55 Consequences of the Multivariate Normal Distribution 3 55 Connection to χ Distribution 4 56 Marginal Distribution 4 57 Conditional Distribution 5 6 The t- and F Distributions 6 6 The t distribution 6 6 The F distribution 7 63 Student s Theorem 7

2 Tuesday 8 October 3 Read Section 3 of Hogg Binomial & Related Distributions The Binomial Distribution Recall our first example of a discrete random variable, the number of heads on three flips of a fair coin In that case, we were able to count the number of outcomes in each event, eg, there were three ways to get two heads and a tail: {HHT, HT H, T HH}, and each of the eight possible outcomes had the same probability, 5 If we consider a case where 8 the coin is weighted so that each flip has a 6% chance of being a head and only a 4% chance of being a tail, then the probability of each of the ways to get two heads and a tail is: P HHT P HT H P T HH a b c In each case, the probability is , so if X is the number of heads in three coin flips, P X We can work out all of the probabilities in this way: #H outcomes # probeach outcome tot prob {T T T } {T T H, T HT, HT T } {T HH, HT H, HHT } {T T T } This is an example of a binomial random variable We have a set of n identical, independent trials, experiments which could each turn out one of two ways We call one of those possible results success eg, heads and the other failure eg, tails The probability of success on a given trial is some number p [, ], so the non-negative integer n and real number p are parameters of the distribution The random variable X is the number of successes in the n trials Evidently, the possible values for X are,,,, n The probability of x successes is px P X x # of outcomesprob of each outcome We can generalize the discussion above to see that the probability for any sequence containing x successes, and therefore n x failures is p x p n x In the simple example we just listed all of the outcomes in a particular event, but that won t be practical in general Instead we fall back on a result from combinatorics The number of ways of choosing x out of the n trials, which we call n choose x, is n x nn n n x + xx x n! n x!x! 3 As a reminder, consider the number of distinct poker hands, consisting of five cards chosen from the 5-card deck If we consider first the cards dealt out in order on the table in front of us, there are 5 possibilities for the first card, 5 for the second, 5 for the third, 49 for the fourth and for the fifth So the number of ways five cards could be dealt to us in order is ! But that is overcounting 47! the number of distinct poker hands, since we can pick up the five cards and rearrange them anyway we like Since we have 5 choices for the first card, 4 for the second, 3 for the third, for the fourth and for the fifth, any five-card poker hand can be rearranged 543 5! different ways So in the list of 5! ordered poker hands, each unordered hand 47!

3 is represented 5! times Thus the number of unordered poker hands is 5 5! 5 47!5! returning to the binomial distribution, this makes the pmf n px p x p n x x,,, n 5 x Some advice on calculating n k There is some art to calculating the binomial coëfficient n n! k n k!k! 6 For one thing, you basically never want to actually calculate the factorials For example, consider It s easy to calculate if we write On the other hand, if we write! 98!! 8 we find that! and 98! are enormous numbers which will overflow calculators, single precision programs, etc A few identities which make these calculations easier: n n! n k k!n k! n 9a k n n n n n n n! n!! 9b n! n!! n 9c Note that this uses the definition that! If k is small, you can write out the fraction, cancelling out n k factors to leave k factors in the numerator and k factors including in the denominator If n k is small, you cancel out k factors to leave n k If k is really small you can use Pascal s triangle n : n : n : n 3: 3 3 n 4: where element k of row n is n k and is created for n > by the recursion relation n n n + k k k I mention this mostly as a reminder that the binomial coëfficient is what appears in the binomial expansion a + b n n k n a k b n k k This can be used to show the binomial pmf is properly normalized: n n n px p x p n x p + [ p] n n x x x As a curiosity, note that sometimes a general formula [such as the recursion relation ] might tell us to calculate some sort 3

4 of impossible binomial coëfficient, like 5 or 4 6 Logically, it would seem reasonable to declare these to be zero, since there are no ways eg, to pick six different items from a list of four In fact, there s a sense in which the definition with factorials extends to this For example, 3 4 3! 4!! 3 Now, the factorial of a negative number is not really defined, but if we try to extend the definition, it turns out to be, in some sense, infinite We will, in the near future, define something known as the gamma function, usually defined by the integral Γα t α e t dt 4 It s easy to do the integral for α and show that Γ Using either integration by parts or parametric differentiation, we can show that Γα + αγα, from which you can show that for non-negative integer n, Γn + n! If we try to use the recursion relation to find Γ, which would be!, we d get Γ Γ 5 which is not possible if Γ is finite So it makes some sense to replace with zero Proceeding in this way you can show that Γ Γα also blows up if α is any negative integer We can also use the Gamma function to extend the definition of the factorial to non-integer values, but that s a matter for another time Finally, there s the question of how to calculate binomial probabilities when x and n x are really large At this point, you re going to be using a computer anyway There are a few tricks for dealing with the large factorials, primarily by calculating ln px rather than px, but many statistical software packages do a lot of the work for you For instance, in R, you can calculate the binomial pdf and/or cdf with commands like n<-5 p<-5 x<-:n px<-dbinomx,n,p Fx<-pbinomx,n,p or with python import numpy from scipystats import binom n5 p5 xnumpyarangen+ pxbinompmfx,n,p Fxbinomcdfx,n,p Properties of the Binomial Distribution Returning to the binomial distribution, we can find its moment generating function and use it to find the mean and variance relatively quickly: Mt Ee tx n e tx px x n pe t x p n x pe t +[ p] n x 6 It is often easier to work with the logarithm of the mgf, known as the cumulant generating function, which you studied in Hogg problem 48: ψt ln Mt n lnpe t + [ p] 7 We use the chain rule to take its derivative: ψ t npe t pe t + [ p] 8 4

5 from which we get the mean EX ψ and differentiating again gives ψ t from which we get the variance np p + [ p] np 9 npe t pe t + [ p] np e t pe t + [ p] VarX ψ np np np p These should be familiar results the mean is sort of obvious if you think about it, but deriving them directly requires tricky manipulations of the sums With the mgf it s child s play The Multinomial Distribution The binomial distribution can be thought of as a special case of the multinomial distribution, which consists of a set of n identical and independent experiments, each of which has k possible outcomes, with probabilities p, p,, p k, with p + p + + p k For the binomial case, k, p becomes p, and p is p Define random variables X, X, X k which are the number of experiments out of the n total with each outcome We could also define X k, but it is completely determined by the others as X k n X X X k The probability of any particular sequence of outcomes of which x are of the first sort, x of the same sort, etc, with x + x + + x k n is p x p x p x k k How many different such outcomes are there? Well, there are n x n! x!n x! ways to pick the x experiments with the first outcome Once we ve done that, there are n x possibilities from which to choose the x outcomes of the second sort, so there are n x x n x! x!n x x ways to do that, or a total of! n n x x x n! n x! x!n x! x!n x x! n! x!x!n x x! 3 ways to pick the experiments that end up with the first two sorts of outcomes Continuing in this way, we find the total number of outcomes of this sort to be n n x n x x n x x x k x x x 3 x k n! x!x! x k! So the joint pdf for these multinomial random variables is n! px, x,, x k x!x! x k! px p x p x k k, x,, n; x,,, n x ; Thursday October 3 Read Section 3 of Hogg 4 ; x k n x x x k 5 3 More on the Binomial Distribution Recall that if we add two independent random variables X and X with mgfs M t Ee tx and M t Ee tx, we can work out the distribution of their sum Y X + X by finding 5

6 its mgf: M Y t Ee ty Ee tx +X Ee tx e tx Ee tx Ee tx M tm t 6 where we have used the fact that the expectation value of a product of functions of independent random variables is the product of their expectation values Now suppose we add independent binomial random variables: X has n trials with a probability of p the notation introduced by Hogg calls this a bn, p and X has n trials, also with a probability of p which Hogg would call bn, p Their mgfs are thus M t pe t + [ p] n and M t pe t + [ p] n If we call their sum Y X + X, its mgf is 7 M Y t M tm t pe t + [ p] n +n 8 but this is just the mgf of a bn + n, p distribution This important result also makes sense from the definition of the binomial distribution Adding the number of successes in n trials and in n more trials is the same as counting the number of successes in n + n trials, as long as all the trials are independent, and the probability for success on each one is p 4 The Negative Binomial Distribution The binomial distribution is appropriate for describing a situation where the number of trials is set in advance Sometimes, however, one decides to do as many trials as are needed to obtain a certain number of successes The random variable in that situation would then be the number of trials that were required, or equivalently, the number of failures that occurred This is know as a negative binomial random variable: given independent trials, each with probability p of success, X is the number of failures before the rth success This probability is a little more involved to estimate To get X x, we have to have r successes and x failures in the first x + r trials, and then a success in the last trial, so the pmf is x + r x + r px p r p x p p r p x r r 9 We can have any non-negative integer number of failures, so the pmf is defined for x,,, You will explore this distribution on the homework, and show that the mgf is Mt p r [ pe t ] r 3 The factor in brackets is a binomial raised to a negative power, which is where the name negative binomial distribution comes from Note that in the special case r, this is the number of failures before the first success, which is the geometric distribution which you ve considered on a prior homework As an aside, the distinction between an experiment where you ve decided in advance to do n trials, and one where you ve decided to stop after r successes, leads to a complication in classical statistical inference called the stopping problem The problem arises because in classical, or frequentist, statistical inference, you have to analyze the results of your experiment in terms of what would happen if you repeated the same experiment many times, and think about how likely results like you saw would be given various statistical models and parameter values To answer such questions, you have to know what experiment you were planning to do, not just what observations 6

7 you actually made One of the advantages of Bayesian statistical inference, which asks the more direct question of how likely various models and parameter values are, given what you actually observed, is that you typically don t have to say what you d do if you repeated the experiment 5 The Hypergeometric Distribution Both the binomial and negative binomial distribution result from repeated independent trials with the same probability p of success on each trial This is a situation that s sometimes called sampling with replacement, since it s what would happen if you put, eg, 35 red balls and 65 white balls in a bowl, drew one at random, noted its color, put it back, mixed up the balls, and repeated Each time there would be a probability of 35/ of drawing a red ball If, on the other hand, you picked the ball out and didn t put it back, the probability of drawing a red ball on the next try would change If the first ball was red, you d have only a 34/ chance of the next one being red since there are now 34 red balls and still 65 white balls in the bowl, while if the first one was white, the probability for the second one to be red would go up to 35/ This is called sampling without replacement In practice, making some sort of huge tree diagram for all of the conditional probabilities, draw after draw, is impractical, so instead, if you want to know the probability to draw, say, three red balls out of ten, you count up the number of ways to do it If you pick balls out of a bowl containing, there are! ways to do 9!! that To count the number of those ways that have 3 red balls and 7 white balls, you need to count the number of ways to pick 3 of the 35 red balls, which is ! times the number of 3!3! ways to pick 7 of the 65 white balls, which is !, so 58!7! the probability of drawing exactly 3 red balls out of when sampling without replacement from a bowl with 35 red balls out of is In general, if there are N balls, of which D are red, and we draw n of them without replacement, the number of red balls in our sample will be a hypergeometric random variable X whose pmf is D N D px x n x N 3 n As you might guess from the name, the mgf for a hypergeometric distribution is a hypergeometric function, which doesn t simplify things much When D and N D are both large compared to n, the hypergeometric distribution reduces approximately to the binomial distribution Jaynes uses the hypergeometric distribution in many of his examples, to avoid making the simplifying assumption of sampling with replacement The Poisson Distribution Consider the numerous statistical statements you get like: On average 83 people per day are killed in traffic accidents in the US On average there are 367 gamma-ray bursts detected per year by orbiting satellites 3 During a rainstorm, an average of 994 raindrops falls on a square foot of ground each minute E T Jaynes, Probability Theory: the Logic of Science Cambridge, 3 I fabricated the last few significant figures in each of these numbers to produce a concrete example 7

8 In each of these cases, the number of events, X, occurring in one representative interval is a discrete random variable with a probability mass function The average number of events occurring is a parameter of this distribution, which Hogg writes as m From the information given above, we only know the mean value of the distribution, EX m The extra piece of information that defines a Poisson random variable is that each of the discrete events to be counted is independent of the others Aside: Simulating a Poisson Process To underline what this independence of events means, and to further to illustrate the kind of situation described by a Poisson distribution, consider the following thought experiment Suppose there s a field divided into a million square-foot patches, and I m told that there is an average of 45 beetles per square foot patch, with the location of each beetle independent of the others If I wanted to simulate that situation, I could distribute 4,5, beetles randomly among the one million patches I d do this by placing the beetles one at a time, randomly choosing a patch among the million available for each beetle, without regard to where any of the other beetles were placed The number of beetles in any one patch would then be pretty close to a Poisson random variable It wouldn t be exactly a Poisson rv, because the requirement of exactly 4,5, beetles total would break the independence of the different patches, but in the limit of infinitely many patches, and a corresponding total number of beetles, it would become an arbitrarily good approximation In fact, the numbers of beetles in each of the one million patches would make a good statistical ensemble for approximately describing the probability distribution for the Poisson random variable Poisson Process and Rate These scenarios described by a Poisson random variable are often associated with some sort of a rate: 83 deaths per day, 367 GRBs per year, 994 raindrops per square foot per minute, 45 beetles per square foot We talk about a Poisson process with an associated rate α eg, a number of events per unit time, and if we count all the events in a certain interval of size T eg, a specified duration of time, it is a Poisson random variable with parameter m αt So for example, if the clicks on a Geiger counter are described by a Poisson process with a rate of 45 clicks per minute, the number of clicks in an arbitrarily chosen oneminute interval of time will be a Poisson random variable with m 45 If the interval of time is 3 seconds half a minute, the Poisson parameter will be m 45/minute5 minute 35 If we count the clicks in four minutes, the number of clicks will be a Poisson random variable with m 45/minute4 minutes 8 Note that m will always be a number, and will not contain a per minute or per square foot, or anything That will be part of the rate α 3 Connection to Binomial Distribution and Derivation of pmf The key information that lets us derive the pmf of a Poisson distribution is what happens if you break the interval up into smaller pieces If there s a Poisson process at play, the number of events in each subinterval will also be a Poisson random variable, and they will all be independent of each other So if we divide the day into equal pieces of 4 minutes 4 seconds each, the number of traffic deaths in each is an independent Poisson random variable with a mean value of 83 Now, in practice this assumption will often not quite be true: the rate of traffic deaths is higher at some times of the day than others, some 8

9 patches of ground may be more prone to be rained on because of wind patterns, etc, but we can imagine an idealized situation in which this subdivision works Okay, so how do we get the pmf? Take the interval in question time interval, area of ground, or whatever and divide it into n pieces Each one of them will have an average of m/n events If we make n really big, so that m/n, the probability of getting one event in that little piece will be small, and the probability that two or more of them happen to occur in the same piece is even smaller, and we can ignore it to a good approximation We can always make the approximation better by making n bigger That means that the number of events in that little piece, call it Y, has a pmf of py p py p py > a b c In order to have EY m/n, the probability of an event occurring in that little piece has to be p m/n But we have now described the conditions for a binomial distribution! Each of the n tiny sub-pieces of the interval is like a trial, a piece with an event is a success, and a piece with no event is a failure So the pmf for this Poisson random variable must be the limit of a binomial distribution as the number of trials gets large: px lim n n! mx x! m x m x!n x! n n m n n! n n x! lim n n x /n m/n Now, the ratio of factorials is a product of x things: n! n x! x nn n n x + 3 The last factor is of course also the product of x things, ie, x identical copies of /n m/n n m 4 But that means the two of them together give you x n! /n n n n n x! m/n n m n m n m n x + n m 5 which is the product of x fractions, each of which goes to as n goes to infinity, so we can lose that factor in the limit and get px mx x! where we ve used the exponential function lim m n m x n n x! e m 6 e α lim + α n ; 7 n n If we recall that the exponential function also has the Maclaurin series e α α k 8 k! we can see that the pmf is normalized: px x k px mx x! e m 9 x m x x! e m e m e m 9

10 We can also use this series to get the mgf tx mx Mt e x! e m met x e m e met e m e met x! x To find the mean and variance, it s useful once again to use the cumulant generating function ψt ln Mt: Differentiating gives us and so the mean is ψt me t ψ t me t 3 ψ t me t 4 EX ψ m 5 which was the definition we started with and the variance is VarX ψ m 6 Note that these are also the limits of the mean and variance of the binomial distribution We can also the mgf to show what happens when we add two Poisson random variables with means m and m ; the mgf of their sum is Mt e m e t e m e t e m +m e t 7 which is the mgf of a Poisson random variable with mean m + m, so the sum of two Poisson rvs is itself a Poisson rv Tuesday 5 October 3 No Class Monday Schedule Thursday 7 October 3 Read Section 33 of Hogg 3 Gamma and Related Distributions 3 Gamma Γ Distribution We now consider our first family of continuous distributions, the Gamma distribution A Gamma random variable has a pdf with two parameters α > and β > : fx Γαβ α xα e x/β < x < 3 We sometimes say that such a random variable has a Γα, β distribution As we ll see, the β parameter just changes the scale of the distribution function, but α actually changes the shape We will show how special cases of the Gamma distribution describe situations of interest, notably the exponential distribution with rate λ, which is Γ,, and the chi-squared distribution with r λ degrees of freedom also known as χ r, which is Γ r, For now, though, let s look at the general Gamma distribu- tion First of all, consider the constant The Γα in the Γαβ α denominator is the Gamma function 3 Γα u α e u du 3 which we introduced last week as a generalization of the factorial function, so that if n is a non-negative integer, Γn + n! 3 Note that in an unfortunate collision of notation, Γα is a function which returns a number for each value of α, while Γα, β is a label we use to refer to a distribution corresponding to a choice of α and β

11 That constant is just what we need to make sure the distribution function is normalized: fx dx Γαβ α Γα Γα x β x α e x/β dx α x/β dx e β u α e u du Γα Γα 33 where we have made the change of variables u x/β The cdf for the Gamma distribution is x F x fy dy x α y y/β dy e Γα β β x/β 34 u α e u du Γα which is usually written in terms of either the incomplete gamma function γy; α y u α e u du 35 defined so that lim y γy; α Γα or the standard Gamma cdf F y; α y u α e u du 36 Γα defined so that lim y F y; α One or both of these functions may be available in your favorite software package, or tabulated in a book For instance, F y; α is tabulated in the back of Devore, Probability and Statistics for Engineering and the Sciences, the book you presumably used for intro Probability and Statistics Hogg does not tabulate these, but it does include some values of the chi-squared cdf, which is a special case of the Gamma cdf Note that if we define a random variable Y X/β, where X is a Γα, β random variable, its pdf is f Y y dp dy dp/dx dy/dx f Xβy /β Γα yα e y 37 which is a Γα, distribution This is what we mean when we say β is a scale parameter Now we turn to the mgf of the Gamma distribution, which is Mt Ee tx Γαβ α Γαβ α x α exp x α e x/β e tx dx [ ] β t x dx 38 If we require t < and make the substitution u tx so β β x β u, this becomes βt α β Mt u α e u du βt α 39 Γαβ α βt If we again construct the cumulant generating function we find the derivative so the mean is and the second derivative ψt ln Mt α ln βt 3 ψ t αβ βt 3 EX ψ αβ 3 ψ t αβ βt 33

12 so the variance is VarX ψ αβ 34 which should be familiar results from elementary probability and statistics Even more useful than the mean and variance, the mgf can be used to show what happens when we add two independent Gamma random variables with the same scale parameter, say X which is Γα, β and X which is Γα, β The mgf of their sum is Mt M tm t βt α βt α βt α +α 35 which we see is the mgf of a Γα +α, β random variable So if you add Gamma rvs with the same scale parameter, their sum is another Gamma rv whose scale parameter is the sum of the individual scale parameters 3 Exponential Distribution Suppose we have a Poisson process with rate λ, so that the probability mass function of the number of events occurring in any period of time of duration t is P n events in t λtn e λt 36 n! We can show that, if you start counting at some moment, the time you have to wait until k more events have occurred is a Γk, random variable In Hogg they show this by summing λ the Poisson distribution from to k, but once we know what happens when you sum Gamma random variables, all we actually have to do is show that the waiting time for the first event is a Γ, random variable If you start waiting for the second λ event once the first has happened, that waiting time is another independent Γ, random variable, and so forth The waiting time for the kth event is thus the sum of k independent λ Γ, random variables, which by the addition property we ve λ just seen is a Γk, random variable λ So we turn to the question of the waiting time for the first event, and return to the Poisson distribution In particular, evaluating the pmf at n we get P no events in t λt e λt e λt 37! Now pick some arbitrary moment and let X be the random variable describing how long we have to wait for the next event If X t that means there are one or more events occurring in the interval of length t starting at that moment, so the probability of this is P X t P no events in t e λt 38 But that is the definition of the cumulative distribution function, so F x P X x e λx 39 Note that it doesn t make sense for X to take on negative values, which is good, since F This means that technically, { x < F x 3 e λx x We can differentiate with respect to x to get the pdf { x < fx λe λx x 3 This is known as the exponential distribution, and if we recall that Γ!, we see that this is indeed the Gamma distribution where α and β λ

13 33 Chi-Square χ Distribution We turn now to another special case of the Gamma distribution If we set α to r, where r is some positive integer, and β to, we get the pdf fx Γ r r/ x r e x/ < x < 3 which is known as a chi-square distribution with r degrees of freedom, or χ r Note that if we add independent chi-square random variables, eg, X which is χ r Γ r, and X which is χ r Γ r,, their sum X + X is Γ r + r, χ r + r So in particular the sum of r independent χ random variables is a χ r random variable Next week, we ll see that a χ random variable is the square of a standard normal random variable, so a χ r is the sum of the squares of r independent standard normal random variables Note that Table II in Appendix C of Hogg has some percentiles values of x for which P X < x is some given value for chi-square distributions with various numbers of degrees of freedom This means that if you have a Gamma random variable which you can scale to be a chi-square random variable which is basically possible if α is any integer or half-integer, you can get percentiles of that random variable 34 Beta β Distribution Please read about the β distribution in section 33 of Hogg Tuesday October 3 Read Section 34 of Hogg 4 The Normal Distribution A random variable X follows a normal distribution, also known as a Gaussian distribution, with parameters µ and σ >, known as Nµ, σ, if its pdf is fx σ π exp x µ σ To show that this is normalized, we take the integral fx dx σ π e x µ σ dx π 4 e z / dz 4 where we have made the substitution z x µ/σ The integral I e z / dz 43 is a bit tricky; there s no ordinary function whose derivative is e z /, so we can t just do an indefinite integral and evaluate at the endpoints But we can do the definite integral by writing I e x / dx e y / dy 44 e x +y / dx dy If we interpret this as a double integral in Cartesian coördinates, we can change to polar coördinates r and φ, and write π I e r / r dr dφ π π e r / π e r / r dr 45 3

14 so I e z / dz π 46 and the pdf does integrate to one To get the mgf, we have to take the integral Mt σ x µ exp + tx dx 47 π σ If we complete the square in the exponent, we get x µ + tx x [µ + tσ ] + µtσ + t σ 4 σ σ σ so Mt σ t σ π eµt+ σ t σ π eµt+ x [µ + tσ ] t σ + µt + σ exp x [µ + tσ ] dx σ e z / dz e µt+ t σ This means that the cumulant generating function is taking derivatives gives so the mean is and ψt ln Mt µt + σ t ψ t µ + σ t 4 EX ψ µ 4 ψ t σ 43 so the variance is VarX ψ σ 44 which means that the parameters µ and σ are the mean and standard deviation of the distribution, as their names suggest Note that this is in some sense the simplest possible distribution with a given mean and variance For a general random variable, since ψ, ψ EX and ψ VarX, the first few terms of the Maclaurin series for ψt must be ψt t EX + t VarX + Ot3 45 Given a random variable X which follows a Nµ, σ distribution, we can define Z X µ Its pdf will be σ f Z z σf X µ + zσ π e z / 46 which is a N, distribution, also known as a standard normal distribution The cdf of a Nµ, σ random variable will be F x σ π x e u µ /σ du π x µ/σ e t / dt 47 again, e t / is not the derivative of any known function, but it s useful enough that we define a function Φz π z e t / dt 48 which is tabulated in lots of places In terms of this, the cdf for a Nµ, σ rv is x µ P X x Φ 49 σ 4

15 If we add two independent Gaussian random variables, X and X, following Nµ, σ and Nµ, σ distributions, respectively, their sum has the mgf Mt M tm t exp exp tµ + t σ tµ + µ + t σ + σ exp tµ + t σ 4 which is the mgf of a Nµ +µ, σ+σ distribution In general, if we add independent random variables, their sum has a mean which is the sum of the means and a variance which is the sum of the variances, but what s notable here is the sum obeys a normal distribution, so it s characterized only by its mean and variance Finally, suppose we have r independent Gaussian random variables {X i } with means µ i and variances σi Consider the combination r Xi µ i Y 4 i We can show that this obeys a χ r distribution f Y y σ i Γr/ r/ y r e y/ 4 As we saw last time, the sum of r independent χ random variables is a χ r random variable, so all we need to do is show that if X is Nµ, σ, so that Z X µ is N,, Y X µ σ σ Z is χ Now, since f Z z π e z / < z < 43 we can t quite use the usual formalism for transformation of pdfs, since the transformation Y Z is not invertible But since f Z z f Z z it s not hard to see that if we define a rv W Z, it must have a pdf f W w f Z w + f Z w f Z z π e w / < w < 44 and then we can use the transformation y w, w y / to work out f Y y dp dy dp dw dw dy y / f W y / y / e y/ < y < π If we recall the χ pdf from last time, it was fy 45 Γ/ / y / e y/ < y < 46 so the pdf of Y Z is the χ pdf, if the value of Γ/ is π Now, if we think about it, that has to be the case, in order for the two pdfs to be normalized, but we can work out the value directly Recall the Gamma function Γα t α e t dt 47 which is a finite positive number for any α > For positive integer n, we know that Γn n! Thus Γ/ t / e t dt 48 We can show that the integral is well-behaved at the lower limit of t, and evaluate it, by changing variables to u t so that t u / and du / t / dt; thus Γ/ e u du π π 49 5

16 as expected We ve used the symmetry of the integrand to say that e u du π e u du 43 Thursday 4 October 3 Read Section 35 of Hogg 5 Multivariate Normal Distribution 5 Linear Algebra: Reminders and Notation If A is an m n matrix: A A A n A A A n A A m A m A mn and B is an n p matrix, B B B p B B B p B B n B n B np 5 5 then their product C AB is an m p matrix as shown in Figure so that C ik n j A ijb jk If A is an m n matrix, B A T is an n m matrix with elements B ij A ji : B B B m A A A m B B B m B A A A m AT B n B n B nm A n A n A mn 54 If v is an n-element column vector which is an n matrix and A is an m n matrix, w Av is an m-element column vector ie, an m matrix: w w w m A A A n v w Av A A A n v A m A m A mn v n A v + A v + + A n v n A v + A v + + A n v n A m v + A m v + + A mn v n 55 so that w i n j A ijv j If u is an n-element column vector, then u T is an n-element row vector a n matrix: u T u u u n 56 If u and v are n-element column vectors, u T v is a number, known as the inner product: v u T v v u u u n v n u v + u v + + u n v n n u i v i i 57 If v is an m-element column vector, and w is an n-element 6

17 C C C p A A A n B B B p C C C p C AB A A A n B B B p C m C m C mp A m A m A mn B n B n B np A B + A B + + A n B n A B + A B + + A n B n A B p + A B p + + A n B np A B + A B + + A n B n A B + A B + + A n B n A B p + A B p + + A n B np A m B + A m B + + A mn B n A m B + A m B + + A mn B n A m B p + A m B p + + A mn B np 53 Figure : Expansion of the product C AB to show C ik n j A ijb jk column vector, A vw T is an m n matrix A A A n A A A n A vwt A m A m A mn v v w v w v w n v v w v w v w n w w w m v m v m w v m w v m w n 58 so that A ij v i w j If M and N are n n matrices, the determinant detmn detm detn If M is an n n matrix known as a square matrix, the inverse matrix M is defined by M M n n MM where n n is the identity matrix n n 59 If M exists, we say M is invertible If M is a real, symmetric n n matrix, so that M T M, ie, M ji M ij, there is a set of n orthonormal eigenvectors {v, v,, v n } with real eigenvalues {λ, λ,, λ n }, so that Mv i λ i v i Orthonormal means { vi T i j v j δ ij 5 i j where we have introduced the Kronecker delta symbol δ ij The eigenvalue decomposition means n M λ i v i vi T 5 i 7

18 The determinant is detm n i λ i If none of the eigenvalues {λ i } are zero, M is invertible, and the inverse matrix is M n i λ i v i v T i 5 If all of the eigenvalues {λ i } are positive, we say M is positive definite If none of the eigenvalues {λ i } are negative, we say M is positive semi-definite 5 Special Case: Independent Gaussian Random Variables Before considering the general multivariate normal distribution, consider the case of n independent normally-distributed random variables {X i } with means {µ i } and variances {σ i } The pdf for X i, the ith random variable, is and its mgf is f i x i exp x i µ i σ i π σ i M i t i exp t i µ i + t i σi If we consider the random variables X i to be the elements of a random vector X X X X n 55 its expectation value is µ µ µ n EX µ and its variance-covariance matrix is σ σ CovX Σ σn which is diagonal because the different X i s are independent of each other and therefore have zero covariance We can thus write the joint mgf for these random variables as n n [ Mt M i t i exp t i µ i + ] t i σi i i exp t T µ + 58 tt Σt We can also write the joint pdf fx n f i x i i n i πσ i exp n x i µ i 59 in matrix form if we consider a few operations on the matrix Σ First, since it s a diagonal matrix, its determinant is just the product of its diagonal entries: det Σ i σ i n σi 5 i 8

19 and, for that matter, detπσ n πσi 5 i Also, we can invert the matrix to get so i σ Σ σ i σ σn 5 n x i µ i x µ T Σ x µ 53 which makes the pdf for the random vector X fx exp detπσ x µt Σ x µ 54 The generalization from n independent normal random variables to an n-dimensional multivariate normal distribution is to use the same matrix form for Mt and just to replace Σ, which was a diagonal matrix with positive diagonal entries, with a general symmetric positive semi-definite matrix So one change is to allow Σ to have off-diagonal entries, and another is to allow it to have zero eigenvalues If Σ is positive definite, ie, its eigenvalues are all positive, we can use the matrix expression for the pdf fx as well As we ll see, if Σ has some zero eigenvalues, we won t be able to define a pdf for the random vector X 53 Multivariate Distributions Recall that a set of n random variables X, X, X n can be combined into a random vector with expectation value and variance-covariance matrix X X X X n µ µ µ n EX µ Σ CovX E[X µ] T [X µ] VarX CovX, X CovX, X n CovX, X VarX CovX, X n CovX, X n CovX, X n VarX n 57 The variance-covariance matrix must be positive semi-definite, ie, have no negative eigenvalues To see why that is the case, let {λ i } be the eigenvalues and {v i } be the orthonormal eigenvectors, so that Σ n i vt iλ i v i For each i define a random variable X i vi T X It has mean EX i vi T µ and variance VarX i E[v T i X µ] E[v T i X µx µ T v i ] v T i E[X µx µ T ]v i v T i Σv i λ i v T i v i λ i 58 9

20 Since the variance of a random variable must be non-negative, Σ cannot have any negative eigenvalues Incidentally, we can see that the different random variables {X i } are uncorrelated, since CovX i, X j E[v T i X µx µ T v j ] v T i E[X µx µ T ]v j v T i Σv j λ j v T i v j λ i δ ij 59 Remember that CovX i, X j does not necessarily imply that X i and X j are independent It will, however, turn out to be the case for normally distributed random variables Note that if we assemble the {X i } into a column vector X X X n v T v T v T n X X ΓX 53 where the matrix Γ is made up out of the components of the orthonormal eigenvectors {v i }: v T v v v n v T Γ v v v n v n v n v n n v T n 53 This matrix is not symmetric, but it is orthogonal, meaning that Γ T Γ We can see this from v T v T ΓΓ T v v v n v T n v T v v T v v T v n v T v v T v v T v n vn T v vn T v vn T v n 53 This matrix Γ can be thought of a transformation from the original basis to the eigenbasis for Σ One effect of this is that it diagonalizes Σ: λ ΓΣΓ T λ λ n Λ 533 Finally, recall that the moment generating function is defined as Mt Ee tx 534 and that if we define ψt ln Mt, and ψ t i µ i 535 t ψ t i t j CovX, X 536 t

21 This means that if we do a Maclaurin expansion of ψt we get, in general, ψt t T µ + tt Σt where the terms indicated by have three or more powers of t 54 General Multivariate Normal Distribution We define a multivariate normal random vector X as a random vector having the moment generating function Mt exp t T µ + tt Σt 538 We refer to the distribution as N n µ, Σ Note that this is equivalent to starting with the Maclaurin series for ψt ln Mt and cutting it off after the quadratic term We start with the mgf rather than the pdf because it applies whether the variance-covariance matrix Σ is positive definite or only positive semi-definite, ie, whether it has one or more zero eigenvalues To see what happens if one or more of the eigenvalues is zero, we use the orthonormal eigenvectors {v i } of Σ to combine the random variables in X into n uncorrelated random variables {X i }, where X i vi T X, which have means EX i vi T µ and variances VarX i λ i If we combine the {X i } into a random vector where X ΓX 539 v T v T v T n Γ 54 is the orthogonal matrix made up out of eigenvector components, X has mean EX Γµ and variance-covariance matrix λ CovX Λ ΓΣΓ T λ λ n 54 The random vector X also follows a multivariate normal distribution, in this case N n Γµ, Λ To show this, we ll show the more general result which is Hogg s theorem 35, that if X is a N n µ, Σ multivariate normal random vector, A is an m n constant matrix and b is an m-element column vector, the random vector Y AX + b also obeys a multivariate normal distribution Note that this works whether m is equal to, less than, or greater than n! We prove this using the mgf The mgf for Y is M Y t E[expt T Y] E[expt T AX + t T b] e ttb E[exp[A T t] T X] 54 Now here is the key step, which Hogg doesn t elaborate on t is an m-element column vector A is an m n matrix, so its transpose A T is an n m matrix, and the combination A T t is an n-element column vector, whose transpose is the n-element row vector [A T t] T t T A 543 Therefore, the expectation value in the last line above is just the mgf for the original multivariate normal random vector X

22 evaluated at the argument A T t: E[exp[A T t] T X] M X A T t exp [A T t] T µ + [AT t] T Σ[A T t] exp t T Aµ + tt AΣA T t This makes the mgf for Y equal to e ttb times this, or M Y t exp t T [Aµ + b] + tt AΣA T t the joint pdf is f X ξ n f Xi ξ i i exp detπλ ξ ΓµT Λ ξ Γµ 547 We can then do a multivariate transformation to get the pdf for X Γ X Γ T X The Jacobian of the transformation is Γ T, whose determinant is either or, because det detγγ T det Γ 548 This is true for any orthogonal matrix This means det Γ, and the pdf for X is simply which is the mgf for a normal random vector with mean Aµ + b and variance-covariance matrix AΣA T, ie, one that obeys a N m Aµ + b, AΣA T distribution So, we return to the random vector X ΓX, which we now see is a multivariate normal random vector with mean Γµ and diagonal variance-covariance matrix Λ Its mgf is M X t exp t T Γµ + n tt Λt exp [t i vi T µ + λ it i ] n i exp t i vi T µ + λ it i i n M Xi t i i 546 which is the mgf of n independent random variables There are two possibilities: either Σ and thus Λ is positive definite, which means all of the {λ i } are positive, or one or more of the {λ i } are zero In the first case, we have the special case we considered before, n independent normally-distributed random variables, so f X x f X Γx 549 If we also note that det Λ n i λ i det Σ, we see that f X x exp detπλ Γx µt Λ Γx µ exp detπσ x µt Γ T Λ Γx µ 55 But the combination Γ T Λ Γ is just the inverse of Σ, because Λ ΓΣΓ T Γ T Σ Γ 55 so we find that, for arbitrary positive definite symmetric Σ, f X x exp detπσ x µt Σ x µ 55 which is exactly the generalization we expected from the n- independent-random-variable case

23 On the other hand, if Σ has one or more zero eigenvalues, so that det Σ, and Σ is not defined, that pdf won t make sense In that case, consider the random variable X i corresponding to the zero eigenvalue λ i Its mgf is Ee t ix i M Xi t i exp t i vi T µ + λ it i exp t i vi T µ 553 but the only way that is possible is if X i is always equal to vi T µ, ie, X i is actually a discrete random variable with pmf { ξ i vi T µ P X i ξ i 554 otherwise This is the limit of a normal distribution as its variance goes to zero Returning to the case where Σ is positive definite, so that each X i is an independent Nv T i µ, λ i random variable, we can construct the corresponding standard normal random variable Z i λ i / X i v T i µ λ i / v T i X i µ 555 which we could combine into a N n, random vector Z Z Z Z n 556 However, it s actually more convenient to combine them into a different N n, random vector Z n v i Z i Γ T Z i n v i λ i / vi T X µ Σ / X µ i 557 with pdf f Z z π n/ e zt z/ 558 Hogg uses this as the starting point for deriving the pdf for the multivariate normal random vector X, with the factor of detσ / 559 det Σ coming from the Jacobian determinant associated with the transformation Tuesday 9 October 3 Read Section 36 of Hogg 55 Consequences of the Multivariate Normal Distribution Recall that if X is a multivariate normal random vector, obeying a N n µ, Σ distribution, this is equivalent to saying X has the its mgf M X t exp t T X + tt Σt 56 Its mean is X µ and its variance-covariance matrix is CovX Σ If the variance-covariance matrix is invertible, then X has the probability density function f X x exp detπσ X µt Σ X µ 56 3

24 55 Connection to χ Distribution We also showed that we could make a random vector Z Z Z Σ / X µ 56 Z n which obeyed a N n, distribution, ie, where the {Z i } were independent standard normal random variables Previously, we showed that if we took the sums of the squares of n independent standard normal random variables, the resulting random variable obeyed a chi-square distribution with n degrees of freedom So in this case we can construct Y n Z i Z T Z [Σ / X µ] T [Σ / X µ] i X µ T Σ / Σ / X µ X µ T Σ X µ 563 and it will be a χ n random variable This is Theorem 354 of Hogg Note that if Σ is diagonal, so that {X i } are n independent random variables, with X i being Nµ i, σi, the chi-square random variable reduces to n Xi µ i Y if Σ diagonal 564 i 56 Marginal Distribution σ i One last question to consider is, what if we split the n random variables in the random vector X into two groups: the first m, which we collect into a random vector X, and the last p m n, which we collect into a random vector X, so that X X where X X X X m X and X X m+ X m+ X n We d like to know what the marginal distributions for X and X are, and also the conditional distributions for X X and X X As a bit of bookkeeping, we partition the mean vector µ µ µ and the variance-covariance matrix Σ Σ Σ Σ Σ in the same way µ is an m-element column vector, µ is a p-element column vector, Σ is a m m symmetric matrix, Σ is a p p symmetric matrix, Σ is a m p matrix ie, m rows and p columns, and Σ is a p m matrix Since Σ is symmetric, we must have Σ Σ T Σ Σ T Σ T 569 Σ so Σ Σ T Now, it s pretty easy to get the marginal distributions using the mgf Partitioning t in the usual way, we see t T µ t T µ + t T µ 57 4

25 and t T Σt t T t T Σ Σ t Σ Σ t t T Σ t + t T Σ t + t T Σ t + t T Σ t t T Σ t + t T Σ t + t T Σ t 57 which is only equal to Mt, t if the cross-part Σ of the variance-covariance matrix is zero So X and X are independent if and only if Σ m p 57 Conditional Distribution where in the last step we ve used the fact that since t T Σ t is a matrix, ie, a number, it is its own transpose: t T Σ t t T Σ t T t T Σ T t t T Σ t 57 Thus the joint mgf is Mt Mt, t exp t T µ + t T µ + tt Σ t + t T Σ t + tt Σ t The mgf for X is thus M t Mt, exp t T µ + tt Σ t which means X is a N m µ, Σ multivariate normal random vector, and likewise the mgf for X is thus M t M, t exp t T µ + tt Σ t 575 so X is N p µ, Σ Hogg also shows this as a special case of the more general result that AX + b is a normal random vector, where A is a m n constant matrix and b is a m-element constant column vector We see that M t M t exp t T µ + t T µ + tt Σ t + tt Σ t 576 Finally, we d like to consider the conditional distribution of X given that X x Hogg proves that the distribution is a multivariate normal, specifically N m µ Σ Σ [x µ ], Σ Σ Σ Σ, by using the mgf, but their proof assumes you already know the form of the distribution The proof assumes that Σ is positive definite, and in particular that Σ exists We might like to try to work this out rather than starting with the answer, and so we could divide the pdf fx fx, x exp detπσ X µt Σ X µ 577 by the marginal pdf f x detπσ exp X µ T Σ X µ to get the conditional pdf f x x fx, x f x Working out the details of this is a little nasty, though, since it requires a block decomposition of Σ, which exists, but is kind of complicated For example [Σ ] [Σ Σ Σ Σ ] It s not hard to see that what comes out will a Gaussian in x minus something, though Still, for simplicity we can limit the explicit demonstration to the case where m and p, so 5

26 n Then all of the blocks in the decomposition of the n n matrices are just numbers, specifically and Σ µ µ µ VarX CovX, X CovX, X VarX σ ρσ σ ρσ σ σ familiar results from matrix algebra tell us that and Σ det Σ σ ρσ σ ρσ σ σ σ σ ρ 58 σ σ ρ σ ρσ σ ρσ σ σ 583 We know that the correlation coefficient ρ obeys ρ ; in order for Σ to be positive definite, we require ρ < You will show on the homework that the joint pdf for this bivariate normal distribution is fx, x πσ σ ρ exp x µ σ ρ + ρx µ x µ x µ σ σ ρ σ ρ 584 and since the marginal distribution for X is x µ f x exp σ π σ 585 the conditional pdf will be f x x fx, x f x σ ρ π exp x µ σ ρ + ρx µ x µ σ σ ρ σ ρ π exp σ ρ ρ x µ σ ρ [ x µ ρσ σ x µ which is a normal distribution with mean µ ρ σ σ x µ and variance σ ρ Thursday 3 October 3 Read Section 4 of Hogg 6 The t- and F Distributions Our final distributions are two distributions related to the normal and chi-square distributions, which are very useful in statistical inference 6 The t distribution If Z is a standard normal random variable [N, ] and V is a chi-square random variable with r degrees of freedom [χ r], and Z and V are independent, the combination T Z V /r 6 ] 586 6

27 is a random variable following a t-distribution with r degrees of freedom; its pdf is f T t r+ + t r < t < 6 where we have not written the explicit normalization constant, in the interest of notational simplicity This pdf can be derived by the change of variables formalism If r is very large, the t- distribution is approximately the same as the standard normal distribution As you ll show on the homework, if r, it s the Cauchy distribution Note that the moment generating function for the t-distribution doesn t exist, since for n r, the integral r+ t n + t r dt 63 diverges, as its integrand becomes, up to a constant, t n r for large t 6 The F distribution If U is a χ r random variable and V is a χ r random variable, and U and V are independent, the combination F U/r V /r 64 obeys an F distribution Again its pdf can be worked out by the change of variables method, and is, up to the normalization constant f F x x r + r r +r x r < x < Student s Theorem We can illustrate the usefulness of the t-distribution, also known as Student s t-distribution, by considering a result by William S Gosset, writing under the pseudonym of Student while working in the Guinness brewery in Dublin Suppose we have n iid normal random variables, each with mean µ and variance σ, ie, X is a random vector with mean µ EX µ µ µ e 66 µ where we have defined the n-element column vector e and variance-covariance matrix 67 CovX σ n n 68 We know from elementary statistics that the sample mean X n X i 69 n and sample variance S n i n X i X 6 i can be used to estimate the mean µ and variance σ, respectively, ie, EX µ and ES σ, and also that VarX σ /n Student s theorem says that the following things are also true: 7

Some Special Distributions (Hogg Chapter Three)

Some Special Distributions (Hogg Chapter Three) Some Special Distributions Hogg Chapter Three STAT 45-: Mathematical Statistics I Fall Semester 5 Contents Binomial & Related Distributions. The Binomial Distribution............... Some advice on calculating