Review of Discrete Probability (contd.)

Size: px

Start display at page:

Download "Review of Discrete Probability (contd.)"

Tyler Nicholson
6 years ago
Views:

1 Stat 504, Lecture 2 1 Review of Discrete Probability (contd.) Overview of probability and inference Probability Data generating process Observed data Inference The basic problem we study in probability: Given a data generating process, what are the properties of the outcomes? The basic problem of statistical inference: Given the outcomes, what can we say about the process that generated the data? (ref: Wasserman(2004))

2 Stat 504, Lecture 2 2 Bernoulli distribution The most basic of all discrete random variables is the Bernoulli. X is said to have a Bernoulli distribution if X = 1 occurs with probability p and X = 0 occurs with probability 1 p, 8 >< p x =1 f(x) = 1 p x =0 >: 0 otherwise. Another common way to write it is f(x) =p x (1 p) 1 x for x =0, 1. Suppose an experiment has only two possible outcomes, success and failure, and let p be the probability of a success. If we let X denote the number of successes (either zero or one), then X will be Bernoulli. The mean of a Bernoulli is E(X) =1(p)+0(1 p) =p, and the variance of a Bernoulli is V (X) = E(X 2 ) ( E(X)) 2 = 1 2 p +0 2 (1 p) p 2 = p(1 p).

3 Stat 504, Lecture 2 3 Binomial distribution Suppose that X 1,X 2,...,X n are independent and identically distributed (iid) Bernoulli random variables, each having the distribution f(x i )=p x i (1 p) 1 x i for x i =0, 1. Let X = P n i=1 X i.thenx is said to have a binomial distribution with parameters n and p, X Bin(n, p). Suppose that an experiment consists of n repeated Bernoulli-type trials, each trial resulting in a success with probability p and a failure with probability 1 p. If all the trials are independent that is, if the probability of success on any trial is unaffected by the outcome of any other trial then the total number of successes in the experiment will have a binomial distribution. The binomial distribution can be written as f(x) = n! x!(n x)! px (1 p) n x for x =0, 1, 2,...,n.

4 Stat 504, Lecture 2 4 The Bernoulli distribution is a special case of the binomial with n =1. Thatis,X Bin(1,p)means that X has a Bernoulli distribution with success probability p. One can show algebraically that if X Bin(n, p) then E(X) =np and V (X) =np(1 p). An easier way to arrive at these results is to note that X = X 1 + X X n where X 1,X 2,...,X n are iid Bernoulli random variables. Then, by the additive properties of mean and variance, E(X) = E(X 1 )+E(X 2 )+ + E(X n ) = np and V (X) = V (X 1 )+V(X 2 )+ + V (X n ) = np(1 p).

5 Stat 504, Lecture 2 5 Note that X will not have a binomial distribution if the probability of success p is not constant from trial to trial, or if the trials are not entirely independent (i.e. a success or failure on one trial alters the probability of success on another trial). If X 1 Bin(n 1,p)andX 2 Bin(n 2,p), then X 1 + X 2 Bin(n 1 + n 2,p) As n increases, for fixed p, the binomial distribution approaches normal distribution N(np, np(1 p)).

6 Stat 504, Lecture 2 6 Poisson distribution The Poisson is a limiting case of the binomial. Suppose that X Bin(n, p) andletn and p 0 in such a way that np λ where λ is a constant. Then, in the limit, X will have a Poisson distribution with parameter λ. The notation X P (λ) will mean X has a Poisson distribution with parameter λ. The Poisson probability distribution is f(x) = λx e λ x =0, 1, 2,... x! The mean and the variance of the Poisson are both λ; that is, E(X) = V (X) = λ. Note that the parameter λ must always be positive; negative values are not allowed. Because the Poisson is limit of the Bin(n, p), it is useful as an approximation to the binomial when n is large and p is small. That is, if n is large and p is small, then n! x!(n x)! px (1 p) n x λx e λ x! where λ = np. The right-hand side of (1) is typically less tedious and easier to calculate than the left-hand side. (1)

7 Stat 504, Lecture 2 7 Aside from its use as an approximation to the binomial, the Poisson distribution is also an important probability model in its own right. It is often used to model discrete events occurring in time or in space. For example, suppose that X is the number of telephone calls arriving at a switchboard in one hour. Suppose that in the long run, the average number of telephone calls per hour is λ. Thenitmaybe reasonable to assume X P (λ). For the Poisson model to hold, however, the average arrival rate λ must be fairly constant over time; that is, there should be no systematic or predictable changes in the arrival rate. Moreover, the arrivals should be independent of one another; that is, the arrival of one call should not make the arrival of another call more or less likely.

8 Stat 504, Lecture 2 8 Likelihood function One of the most fundamental concepts of modern statistics is that of likelihood. In each of the discrete random variables we have considered thus far, the distribution depends on one or more parameters that are, in most statistical applications, unknown. In the Poisson distribution, the parameter is λ. In the binomial, the parameter of interest is p (since n is typically fixed and known). Likelihood is a tool for summarizing the data s evidence about parameters. Let us denote the unknown parameter(s) of a distribution generically by θ. Since the probability distribution depends on θ, we can make this dependence explicit by writing f(x) as f(x ; θ). For example, in the Bernoulli distribution the parameter is θ = p, and the distribution is f(x ; p) =p x (1 p) 1 x x =0, 1. (2) Once a value of X has been observed, we can plug this observed value x into f(x ; p) andobtaina function of p only. For example, if we observe X =1, then plugging x = 1 into (2) gives the function p. If we observe X = 0, the function becomes 1 p.

9 Stat 504, Lecture 2 9 Whatever function of the parameter results when we plug the observed data x into f(x ; θ) is called the likelihood function. We will write the likelihood function as L(θ ; x) = Q n i=1 f(x i; θ) or sometimes just L(θ). Algebraically, the likelihood L(θ ; x) is just the same as the distribution f(x ; θ), but its meaning is quite different because it is regarded as a function of θ rather than a function of x. Consequently, a graph of the likelihood usually looks very different from a graph of the probability distribution. For example, suppose that X has a Bernoulli distribution with unknown parameter p. We can graph the probability distribution for any fixed value of p. For example, if p =.5 wegetthis: f(x).50 x 0 1

10 Stat 504, Lecture 2 10 Now suppose that we observe a value of X, say X = 1. Plugging x = 1 into the distribution p x (1 p) 1 x gives the likelihood function L(p ; x) =p, which looks like this: 1.0 L(p;x) 0 1 p For discrete random variables, a graph of the probability distribution f(x ; θ) has spikes at specific values of x, whereas a graph of the likelihood L(θ ; x) is a continuous curve (e.g. a line) over the parameter space, the domain of possible values for θ. L(θ ; x) summarizes the evidence about θ contained in the event X = x. L(θ ; x) is high for values of θ that make X = x likely, and small for values of θ that make X = x unlikely. In the Bernoulli example, observing X = 1 gives some (albeit weak) evidence that p is nearer to 1 than to 0, so the likelihood for x =1risesasp moves from 0 to 1.

11 Stat 504, Lecture 2 11 Maximum-likelihood (ML) estimation Suppose that an experiment consists of n =5 independent Bernoulli trials, each having probability of success p. LetX be the total number of successes in the trials, so that X Bin(5,p). If the outcome is X = 3, the likelihood is L(p ; x) = = n! x!(n x)! px (1 p) n x 5! 3! (5 3)! p3 (1 p) 5 3 p 3 (1 p) 2 where the constant at the beginning is ignored. A graph of L(p; x) =p 3 (1 p) 2 over the unit interval p (0, 1) looks like this:

12 Stat 504, Lecture 2 12 It s interesting that this function reaches its maximum value at p =.6. An intelligent person would have said that if we observe 3 successes in 5 trials, a reasonable estimate of the long-run proportion of successes p would be 3/5 =.6. This example suggests that it may be reasonable to estimate an unknown parameter θ by the value for which the likelihood function L(θ ; x) is largest. This approach is called maximum-likelihood (ML) estimation. We will denote the value of θ that maximizes the likelihood function by ˆθ, read theta hat. ˆθ is called the maximum-likelihood estimate (MLE) of θ.

13 Stat 504, Lecture 2 13 Finding MLE s usually involves techniques of differential calculus. To maximize L(θ ; x) with respect to θ: first calculate the derivative of L(θ ; x) with respect to θ, set the derivative equal to zero, and solve the resulting equation for θ. These computations can often be simplified by maximizing the loglikelihood function, l(θ ; x) =logl(θ ; x), where log means natural log (logarithm to the base e). Because the natural log is an increasing function, maximizing the loglikelihood is the same as maximizing the likelihood. The loglikelihood often has a much simpler form than the likelihood and is usually easier to differentiate.

14 Stat 504, Lecture 2 14 In Stat 504 you will not be asked to derive MLE s by yourself. In most of the probability models that we will use later in the course (logistic regression, loglinear models, etc.) no explicit formulas for MLE s are available, and we will have to rely on computer packages to calculate the MLE s for us. For the simple probability models we have seen thus far, however, explicit formulas for MLE s are available and are given next.

15 Stat 504, Lecture 2 15 ML for Bernoulli trials. If our experiment is a single Bernoulli trial and we observe X = 1 (success) then the likelihood function is L(p ; x) = p. This function reaches its maximum at ˆp =1. IfweobserveX =0 (failure) then the likelihood is L(p ; x) =1 p, which reaches its maximum at ˆp = 0. Of course, it is somewhat silly for us to try to make formal inferences about θ on the basis of a single Bernoulli trial; usually multiple trials are available. Suppose that X =(X 1,X 2,...,X n )representsthe outcomes of n independent Bernoulli trials, each with success probability p. The likelihood for p based on X is defined as the joint probability distribution of X 1,X 2,...,X n.sincex 1,X 2,...,X n are iid random variables, the joint distribution is L(p ; x) = f(x ; p) ny = f(x i ; p) = i=1 ny p x i (1 p) 1 x i i=1 = p P n i=1 x i (1 p) n P n i=1 x i.

16 Stat 504, Lecture 2 16 Differentiating the log of L(p ; x) withrespecttop and setting the derivative to zero shows that this function achieves a maximum at ˆp = P n i=1 x i/n. Since P n i=1 x i is the total number of successes observed in the n trials, ˆp is the observed proportion of successes in the n trials. We often call ˆp the sample proportion to distinguish it from p, the true or population proportion. For repeated Bernoulli trials, the MLE ˆp is the sample proportion of successes.

17 Stat 504, Lecture 2 17 ML for Binomial. Suppose that X is an observation from a binomial distribution, X Bin(n, p), where n is known and p is to be estimated. The likelihood function is L(p ; x) = n! x!(n x)! px (1 p) n x, which, except for the factor n!/(x!(n x)!), is identical to the likelihood from n independent Bernoulli trials with x = P n i=1 x i. But since the likelihood function is regarded as a function only of the parameter p, thefactorn!/(x!(n x)!) is a fixed constant and does not affect the MLE. Thus the MLE is again ˆp = x/n, the sample proportion of successes.

18 Stat 504, Lecture 2 18 The fact that the MLE based on n independent Bernoulli random variables and the MLE based on a single binomial random variable are the same is not surprising, since the binomial is the result of n independent Bernoulli trials anyway. In general, whenever we have repeated, independent Bernoulli trials with the same probability of success p for each trial, the MLE will always be the sample proportion of successes. This is true regardless of whether we know the outcomes of the individual trials X 1,X 2,...,X n, or just the total number of successes for all trials X = P n i=1 X i.

19 Stat 504, Lecture 2 19 Suppose now that we have a sample of iid binomial random variables. For example, suppose that X 1,X 2,...,X 10 are an iid sample from a binomial distribution with n = 5 and p unknown. Since each X i is actually the total number of successes in 5 independent Bernoulli trials, and since the X i s are independent of one another, their sum X = P 10 i=1 X i is actually the total number of successes in 50 independent Bernoulli trials. Thus X Bin(50,p) and the MLE is ˆp = x/n, the observed proportion of successes across all 50 trials. Whenever we have independent binomial random variables with a common p, we can always add them together to get a single binomial random variable. Adding the binomial random variables together produces no loss of information about p if the model is true. But collapsing the data in this way may limit our ability to diagnose model failure, i.e. to check whether the binomial model is really appropriate.

20 Stat 504, Lecture 2 20 ML for Poisson. Suppose that X =(X 1,X 2,...,X n ) are iid observations from a Poisson distribution with unknown parameter λ. The likelihood function is L(λ ; x) = = ny f(x i ; λ) i=1 ny i=1 λ x i e λ x i! = λpn i=1 x i e nλ x 1! x 2! x n! By differentiating the log of this function with respect to λ, one can show that the maximum is achieved at ˆλ = P n i=1 x i/n. Thus, for a Poisson sample, the MLE for λ isjustthesamplemean. Next: What happens to the loglikelihood as n gets large

Loglikelihood and Confidence Intervals

Stat 504, Lecture 2 1 Loglikelihood and Confidence Intervals The loglikelihood function is defined to be the natural logarithm of the likelihood function, l(θ ; x) = log L(θ ; x). For a variety of reasons,