A Primer on Statistical Inference using Maximum Likelihood November 3, 2017 1 Inference via Maximum Likelihood Statistical inference is the process of using observed data to estimate features of the population. In terms of distributions, statistical inference is using observed data to estimate parameters of the corresponding distribution. For example, if we assume that observed data Y 1,..., Y n follow a N pµ, σ 2 q distribution, then we would need to estimate the mean µ and variance σ 2 using the data Y 1,..., Y n. There are various ways of performing inference including method of moments estimation, generalized estimating equations and Bayesian inference. However, here were are going to focus on maximum likelihood estimation which is a very common form of estimation (inference) based on maximizing probabilities. 1.1 Intuition for Maximum Likelihood Before we get to the specifics of maximum likelihood estimation, let s start off with a simple example to get the intuition behind the inference technique. Suppose we have two six sided dice in a black box that are identical in all ways except their probabilities. Specifically, the probabilities associated with each dice are as follows: Outcome 1 2 3 4 5 6 Die #1 1/6 1/6 1/6 1/6 1/6 1/6 Die #2 1/3 1/6 1/6 1/3 0 0 Now, pretend that we pull out one of the die and our goal is to figure out (just by rolling it or observing random outcomes ) which dice we have. Let Y i be the observation associated with the i th roll of the die. Notice that we can think of Y i as being a random variables with a distribution given by one of the rows in the table. The trouble is we don t know which row corresponds to the distribution of Y i so we need to roll the die n times and observe Y 1,..., Y n to try and figure it out. In other words, after rolling the die n times, we are going to use Y 1,..., Y n to infer which die we have. So, lets start rolling the die and try to make inference. 1
On our first roll we get Y 1 3: which die do we think it is? Well, lets calculate the probability of Y 1 3 under both die scenarios: PrpY 1 3 Die #1q 1{6 PrpY 1 3 Die #2q 1{6 so, under either die, the probability is the same. So after n 1 roll we really can t infer which die we have so lets roll again. On our second roll we get Y 2 1: which die do we think it is? Again, lets calculate the probabilities under each die: PrpY 1 3, Y 2 1 Die #1q 1{6 ˆ 1{6 1{36 PrpY 1 3, Y 2 1 Die #2q 1{6 ˆ 1{3 1{18 where the multiplication comes from assuming independence of rolls (which is quite reasonable in this case). Notice, that the probability of observing Y 1 3 and Y 2 1 under Die #2 is more likely so at this point we are going to infer that we have Die #2. In other words, we are going to choose the die that maximizes the probability of our data under each die. Just for be confident in our decision of die #2 from above, we decide to roll it again and get Y 3 6. Now which die do we think we have? The answer is obvious when calculating probabilities: PrpY 1 3, Y 2 1 Die #1q 1{6 ˆ 1{6 ˆ 1{6 1{216 PrpY 1 3, Y 2 1 Die #2q 1{6 ˆ 1{3 ˆ 0 0. At this point we know for sure which die we have: we have die #1 because we could never get a 6 under die #2. In this little example, we used our observed data Y 1, Y 2, Y 3 to make inference about which die we have. Specifically, we made the inference that maximized the probability of our observed data. This is what is referred to as maximum likelihood estimation. 1.2 Univariate Maximum Likelihood Estimation Suppose we observe n data points Y 1,..., Y n. Statistical inference for the population associated with Y 1,..., Y n proceeds as follows: (i) explore the data to determine what distribution shape is appropriate for Y 1,..., Y n, (ii) after determining a shape (e.g. normal) determine which parameters of the distribution are unknown, (iii) calculate the joint probability of Y 1,..., Y n under this distribution then (iv) infer the unknown parameters via maximizing the joint probability of the observed data. To get into the details a bit, lets walk through a real example together. Suppose I am trying to find out the success rate of a flu vaccine in preventing flu. In other words, I am trying to figure out the probability of contracting the flu given the person received the vaccine. Notationally, let p represent the probability of contracting the flu if a person receives the vaccine. Since our goal is to infer p, I give the flu vaccine to 100 people and find that 7 of them still got the flu. Let Y i be the outcome for person i where Y i 1 if person i got the flu and Y i 0 if person i didn t get the 2
flu. From the 100 people, I observed Y 1 Y 7 1 and Y 8 Y 100 0. Lets perform statistical inference for p. Step #1 is to figure out an appropriate distribution for Y i. In this case, the Bernoulli distribution is a distribution that correspond to binary (0/1) outcomes so lets us that. Under the Bernoulli distribution, PrpY i pq p Y i p1 pq 1 Y i so that PrpY i 1q p and PrpY i 0q 1 p leaving us with p as the probability of contracting the flu when a person receives the vaccine. Step #2 is to figure out the unknown parameters associated with my chosen distribution. In this case, the only unknown parameter is p itself. Step #3 is to calculate the joint distribution of Y 1,..., Y n. To do this, lets assume independence of events so that, nź PrpY 1,..., Y n pq p Y i p1 pq 1 Y i i 1 p Y 1 p Y2 p Yn ˆ p1 pq 1 Y 1 p1 pq 1 Y2 p1 pq 1 Yn p ř n p1 pq ř n i 1 p1 Y iq p ř n p1 pq n řn. Taking a look at this last form, notice that ř n is the just the number of people who got the flu (take a minute to convince yourself of this if you don t see it). That means that n řn is the number of people who didn t get the flu. Notationally let n flu be the number of people who got the flu and n n flu are the number of people who didn t. We can rewrite the last equation above as PrpY 1,..., Y n pq p n flu p1 pq n n flu (1) Step #4 is to choose p that maximizes the joint probability that we just calculated (this is equivalent to choosing the die that maximizes the joint probability of the die outcomes above). The second we go to do this, notice that we are no longer thinking of Equation (1) as a function of Y i (which we did when we calculated it). Rather, we are considering Equation (1) as a function of the parameter p. For this reason we call (1) the likelihood of p (which is the same thing as saying the probability of Y 1,..., Y n ) because we thinking of it as a function of p (not Y i ). Hence, this is where we get the name maximum likelihood estimation. To maximize the likelihood, we need to take derivatives of (1) with respect to p. This derivative would be quite ugly so let s do something simpler using a calculus trick we remember from high school (or if you don t remember then let me remind you). The trick is to maximize the logarithm of (1) rather than the original function. Standard math results show us that the maximum of the natural logarithm is the same as the maximum. The natural log of (1) is Lppq n flu lnppq ` pn n flu q lnp1 pq (2) where I am using L to denote the log-likelihood of p. This log-likelihood is substantially easier to take derivatives of so lets do it, dlppq dp n flu p ` p 1qpn n fluq 1 p 3
where the 1 comes from the chain rule for derivatives. To maximize, we set the derivative equation to zero and solve for p. Lets do that, dlppq dp n flu p ` p 1qpn n fluq 1 p pn n fluq 1 p ñ n flu p ñ p1 pqn flu ppn n flu q ñ n flu pn flu ppn n flu q ñ n flu pn pn flu ` pn flu ñ n flu pn ñ p n flu n so that our inferred estimate of p is p n flu {n where I use p to denote that it is our estimate of p NOT p itself (p is the parameter while p is the statistic). Coincidentally, this is the exact thing you were taught in 121 to do when calculating p. Namely, number of successes/(total number). So, if you ever wondered why we teach you that then here you go: its the maximum likelihood estimate of the probability (or proportion). Now, its your turn to try and use maximum likelihood estimation on a real dataset. The file WomensHeights.txt contains measurements from 77 womens heights in inches. Your goal is to infer about the population of womens heights using this data. Do the following: 1. Complete steps #1 and #2 by drawing a histogram to confirm that the normal shape is appropriate for this data. Under the normal distribution, the unknown parameters would be the population mean µ and variance σ 2. 2. Write out the joint probability of Y 1,..., Y 7 7. Call this the likelihood of µ and σ 2. In case you have forgotten, if Y i is normal then " PrpY i µ, σ 2 1 q? exp 1 * 2πσ 2 2σ py 2 i µq 2. Technically the above equation is not a probability for Y i but a density of Y i. There is no harm, however, in thinking about it as a probability. 3. In preparation for Step #4 (maximizing), calculate the log-likelihood by taking the natural logarithm of your answer in #2. 4. Find the maximum likelihood estimate for µ but maximizing your log-likelihood in #3 by taking a derivative with respect to µ, setting it equal to 0, and then solving. What is the maximum likelihood estimate for µ? Do you recognize this from 121 (hint: you should)? 5. Now, use your data from the 77 women to get what the maximum likelihood estimate of µ is for this dataset. 0 4
1.3 Multivariate Maximum Likelihood Estimation Turn now to the situation where we have a multivariate observation Y i py i1,..., Y ip q 1 rather than a univariate one. The technique of maximizing the likelihood with a multivariate response is no different than in the univariate case. That is, we take exactly the same steps as we did before. Calculating the joint probability, though, can catch people who aren t careful (which isn t you of course). That is, our data are Y 1,..., Y n where each Y i is a vector of P units of information. So, assuming independent, calculating the joint probability of Y 1,..., Y n is PrpY 1,..., Y n parametersq nź PrpY i parametersq where PrpY i parametersq is itself a probability for the multivariate vector Y i. As an example of multivariate maximum likelihood estimation, consider the following example. According to the theory of left-brain or right-brain dominance, each side of the brain controls different types of thinking. Additionally, people are said to prefer one type of thinking over the other. For example, a person who is left-brained is often said to lean toward mathematical and quantitative thinking while a person who is right-brained is said to be creative and excel in verbal skills. Do people tend to be only left- or right-brained? To test this out, the ACT.txt dataset contains n 117 measurements of student ACT scores on the math and verbal section. Let Y i1 denote student i s score on the math section of the ACT and Y i2 denote the same student s score on the verbal section where i 1,..., n. We can test the sided brain theory by looking at the relationship between Y i1 and Y i2 as described by their joint distribution. So, if we know the joint distribution then we can see if math people tend to NOT be verbal and vice versa. Perform inference for the joint distribution of Y i py i1, Y i2 q 1 from the ACT.txt dataset by doing the following: 1. Complete Steps #1 and #2 by checking if the shape of the multivariate normal distribution for Y i is reasonable by (a) drawing histograms (or smoothed histograms called kernel density estimates ) of Y i1 and Y i2 individually. i 1 (b) drawing a 2-D kernel density estimate of the Y i. If the joint distribution of Y i is MVN then each distribution individually should be normal and the joint distribution should look similar to the pictures you drew in the primer on random variables and their distributions. The parameters of the multivariate normal distribution are the mean vector µ and the covariance matrix Σ. 2. Write out the joint probability of Y 1,..., Y n. To complete this problem, you should know that if Y i is multivariate normal then ˆ P {2 " * 1 PrpY i µ, Σq Σ 1{2 exp 1 2π 2 py i µq 1 Σ 1 py i µq 3. In preparation for Step #4 (maximizing) calculate the log-likelihood by taking the natural logarithm of your answer in #2. 5
4. Find the maximum likelihood estimate for µ by maximizing your log-likelihood in #3 by taking a derivative with respect to µ, setting it equal to 0, and then solving. What is the maximum likelihood estimate for µ? 5. Use your data to get the actual maximum likelihood estimate of µ for this problem. 2 Properties of Maximum Likelihood Estimators Maximum likelihood is particularly popular for inference because of a few really cool properties associated with the maximum likelihood estimates (which I ll abbreviate to MLE for short). First, notice that the MLEs are really just functions of data and, if you got new data, you would get a different answer. Because your data is a random variable, then so is the MLE (the randomness associated with the data gets passed onto the MLE). So, we can ask, if the MLE is a random variable, what then is its distribution? The answer, as it turns out is Normal as long as we have a large sample size. This is referred to as the central limit theorem of MLEs. This basically means that we can use the normal distribution to calculate probabilities associated with the MLE. This is particularly helpful when constructing confidence intervals for the population parameters. Second, the MLE is consistent. Basically, this means that as your sample size increase then the MLE will get closer and closer to the true parameter. You would think that should be a must have property of any estimate but there are some estimates out there for which this isn t true so we ll just be grateful that the MLE is consistent. Finally, the third property of the MLE is called invariance. This just means that the MLE of any function of a parameter is that same function of the MLE. For example, suppose we are interested in logppq from the flu example above. Under invariance of the MLE, the MLE of logppq would be logppq. Again, this seems like it should be obvious but there are techniques for which this isn t true. I only mention these properties here because we are likely (but not certain) to need these as we look at complicated data sets. We ll return to them and go into more detail as needed. 6