A Primer on Statistical Inference using Maximum Likelihood

Similar documents
Ozone Project. Motivating Application. Model/Distribution Specification

Solving with Absolute Value

Section 5.4. Ken Ueda

Conditional probabilities and graphical models

Confidence Intervals

Confidence Intervals and Hypothesis Tests

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Fitting a Straight Line to Data

Calculus II. Calculus II tends to be a very difficult course for many students. There are many reasons for this.

Problem Solving. Kurt Bryan. Here s an amusing little problem I came across one day last summer.

Sometimes the domains X and Z will be the same, so this might be written:

Statistical Distribution Assumptions of General Linear Models

Problems from Probability and Statistical Inference (9th ed.) by Hogg, Tanis and Zimmerman.

STA Why Sampling? Module 6 The Sampling Distributions. Module Objectives

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Math 381 Midterm Practice Problem Solutions

Chapter 26: Comparing Counts (Chi Square)

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

Sampling Distribution Models. Chapter 17

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

MIT BLOSSOMS INITIATIVE

review session gov 2000 gov 2000 () review session 1 / 38

Descriptive Statistics (And a little bit on rounding and significant digits)

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Part 6: Multivariate Normal and Linear Models

In this unit we will study exponents, mathematical operations on polynomials, and factoring.

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Nondeterministic finite automata

Chapter 18. Sampling Distribution Models /51

Algebra & Trig Review

Introduction to Matrix Algebra and the Multivariate Normal Distribution

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Page 1. These are all fairly simple functions in that wherever the variable appears it is by itself. What about functions like the following, ( ) ( )

CS 124 Math Review Section January 29, 2018

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Please bring the task to your first physics lesson and hand it to the teacher.

Probability Distributions: Continuous

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Chapter 1 Review of Equations and Inequalities

Fifth Grade Science End-Of-Grade Test Preparation. Test-Taking Strategies per NCDPI Released Form E ( )

Take the Anxiety Out of Word Problems

Algebra Exam. Solutions and Grading Guide

What s the Average? Stu Schwartz

Introduction. So, why did I even bother to write this?

Physics 6A Lab Experiment 6

Math Review Sheet, Fall 2008

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

STAT/SOC/CSSS 221 Statistical Concepts and Methods for the Social Sciences. Random Variables

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Generalized Linear Models for Non-Normal Data

1 Probabilities. 1.1 Basics 1 PROBABILITIES

CSC2515 Assignment #2

CPSC 340: Machine Learning and Data Mining

Chapter 11. Regression with a Binary Dependent Variable

Unit 4 Patterns and Algebra

Probability and Inference. POLI 205 Doing Research in Politics. Populations and Samples. Probability. Fall 2015

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Introduction to Algebra: The First Week

The Conditions are Right

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

Chapter 18. Sampling Distribution Models. Copyright 2010, 2007, 2004 Pearson Education, Inc.

DIFFERENTIAL EQUATIONS

Practice Problems Section Problems

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Chapter 27 Summary Inferences for Regression

PHY 101L - Experiments in Mechanics

P (A) = P (B) = P (C) = P (D) =

Steve Smith Tuition: Maths Notes

Lesson 11-1: Parabolas

A Flag of Many Faces by Kelly Hashway

Inferring information about models from samples

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

of 8 28/11/ :25

#29: Logarithm review May 16, 2009

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

POLI 8501 Introduction to Maximum Likelihood Estimation

Intersecting Two Lines, Part Two

Confidence intervals

Proof Techniques (Review of Math 271)

Bayesian Linear Regression [DRAFT - In Progress]

Introduction to Maximum Likelihood Estimation

Module 6: Methods of Point Estimation Statistics (OA3102)

PMR Learning as Inference

The General Linear Model. How we re approaching the GLM. What you ll get out of this 8/11/16

Markov Chain Monte Carlo

Open book, but no loose leaf notes and no electronic devices. Points (out of 200) are in parentheses. Put all answers on the paper provided to you.

2. l = 7 ft w = 4 ft h = 2.8 ft V = Find the Area of a trapezoid when the bases and height are given. Formula is A = B = 21 b = 11 h = 3 A=

4.5 Linearization Calculus 4.5 LINEARIZATION. Notecards from Section 4.5: Linearization; Differentials. Linearization

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Physics 6A Lab Experiment 6

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

base 2 4 The EXPONENT tells you how many times to write the base as a factor. Evaluate the following expressions in standard notation.

Sums of Squares (FNS 195-S) Fall 2014

Using Probability to do Statistics.

Understanding Exponents Eric Rasmusen September 18, 2018

The Bayes classifier

Discrete Mathematics for CS Spring 2006 Vazirani Lecture 22

Student Activity: Finding Factors and Prime Factors

Transcription:

A Primer on Statistical Inference using Maximum Likelihood November 3, 2017 1 Inference via Maximum Likelihood Statistical inference is the process of using observed data to estimate features of the population. In terms of distributions, statistical inference is using observed data to estimate parameters of the corresponding distribution. For example, if we assume that observed data Y 1,..., Y n follow a N pµ, σ 2 q distribution, then we would need to estimate the mean µ and variance σ 2 using the data Y 1,..., Y n. There are various ways of performing inference including method of moments estimation, generalized estimating equations and Bayesian inference. However, here were are going to focus on maximum likelihood estimation which is a very common form of estimation (inference) based on maximizing probabilities. 1.1 Intuition for Maximum Likelihood Before we get to the specifics of maximum likelihood estimation, let s start off with a simple example to get the intuition behind the inference technique. Suppose we have two six sided dice in a black box that are identical in all ways except their probabilities. Specifically, the probabilities associated with each dice are as follows: Outcome 1 2 3 4 5 6 Die #1 1/6 1/6 1/6 1/6 1/6 1/6 Die #2 1/3 1/6 1/6 1/3 0 0 Now, pretend that we pull out one of the die and our goal is to figure out (just by rolling it or observing random outcomes ) which dice we have. Let Y i be the observation associated with the i th roll of the die. Notice that we can think of Y i as being a random variables with a distribution given by one of the rows in the table. The trouble is we don t know which row corresponds to the distribution of Y i so we need to roll the die n times and observe Y 1,..., Y n to try and figure it out. In other words, after rolling the die n times, we are going to use Y 1,..., Y n to infer which die we have. So, lets start rolling the die and try to make inference. 1

On our first roll we get Y 1 3: which die do we think it is? Well, lets calculate the probability of Y 1 3 under both die scenarios: PrpY 1 3 Die #1q 1{6 PrpY 1 3 Die #2q 1{6 so, under either die, the probability is the same. So after n 1 roll we really can t infer which die we have so lets roll again. On our second roll we get Y 2 1: which die do we think it is? Again, lets calculate the probabilities under each die: PrpY 1 3, Y 2 1 Die #1q 1{6 ˆ 1{6 1{36 PrpY 1 3, Y 2 1 Die #2q 1{6 ˆ 1{3 1{18 where the multiplication comes from assuming independence of rolls (which is quite reasonable in this case). Notice, that the probability of observing Y 1 3 and Y 2 1 under Die #2 is more likely so at this point we are going to infer that we have Die #2. In other words, we are going to choose the die that maximizes the probability of our data under each die. Just for be confident in our decision of die #2 from above, we decide to roll it again and get Y 3 6. Now which die do we think we have? The answer is obvious when calculating probabilities: PrpY 1 3, Y 2 1 Die #1q 1{6 ˆ 1{6 ˆ 1{6 1{216 PrpY 1 3, Y 2 1 Die #2q 1{6 ˆ 1{3 ˆ 0 0. At this point we know for sure which die we have: we have die #1 because we could never get a 6 under die #2. In this little example, we used our observed data Y 1, Y 2, Y 3 to make inference about which die we have. Specifically, we made the inference that maximized the probability of our observed data. This is what is referred to as maximum likelihood estimation. 1.2 Univariate Maximum Likelihood Estimation Suppose we observe n data points Y 1,..., Y n. Statistical inference for the population associated with Y 1,..., Y n proceeds as follows: (i) explore the data to determine what distribution shape is appropriate for Y 1,..., Y n, (ii) after determining a shape (e.g. normal) determine which parameters of the distribution are unknown, (iii) calculate the joint probability of Y 1,..., Y n under this distribution then (iv) infer the unknown parameters via maximizing the joint probability of the observed data. To get into the details a bit, lets walk through a real example together. Suppose I am trying to find out the success rate of a flu vaccine in preventing flu. In other words, I am trying to figure out the probability of contracting the flu given the person received the vaccine. Notationally, let p represent the probability of contracting the flu if a person receives the vaccine. Since our goal is to infer p, I give the flu vaccine to 100 people and find that 7 of them still got the flu. Let Y i be the outcome for person i where Y i 1 if person i got the flu and Y i 0 if person i didn t get the 2

flu. From the 100 people, I observed Y 1 Y 7 1 and Y 8 Y 100 0. Lets perform statistical inference for p. Step #1 is to figure out an appropriate distribution for Y i. In this case, the Bernoulli distribution is a distribution that correspond to binary (0/1) outcomes so lets us that. Under the Bernoulli distribution, PrpY i pq p Y i p1 pq 1 Y i so that PrpY i 1q p and PrpY i 0q 1 p leaving us with p as the probability of contracting the flu when a person receives the vaccine. Step #2 is to figure out the unknown parameters associated with my chosen distribution. In this case, the only unknown parameter is p itself. Step #3 is to calculate the joint distribution of Y 1,..., Y n. To do this, lets assume independence of events so that, nź PrpY 1,..., Y n pq p Y i p1 pq 1 Y i i 1 p Y 1 p Y2 p Yn ˆ p1 pq 1 Y 1 p1 pq 1 Y2 p1 pq 1 Yn p ř n p1 pq ř n i 1 p1 Y iq p ř n p1 pq n řn. Taking a look at this last form, notice that ř n is the just the number of people who got the flu (take a minute to convince yourself of this if you don t see it). That means that n řn is the number of people who didn t get the flu. Notationally let n flu be the number of people who got the flu and n n flu are the number of people who didn t. We can rewrite the last equation above as PrpY 1,..., Y n pq p n flu p1 pq n n flu (1) Step #4 is to choose p that maximizes the joint probability that we just calculated (this is equivalent to choosing the die that maximizes the joint probability of the die outcomes above). The second we go to do this, notice that we are no longer thinking of Equation (1) as a function of Y i (which we did when we calculated it). Rather, we are considering Equation (1) as a function of the parameter p. For this reason we call (1) the likelihood of p (which is the same thing as saying the probability of Y 1,..., Y n ) because we thinking of it as a function of p (not Y i ). Hence, this is where we get the name maximum likelihood estimation. To maximize the likelihood, we need to take derivatives of (1) with respect to p. This derivative would be quite ugly so let s do something simpler using a calculus trick we remember from high school (or if you don t remember then let me remind you). The trick is to maximize the logarithm of (1) rather than the original function. Standard math results show us that the maximum of the natural logarithm is the same as the maximum. The natural log of (1) is Lppq n flu lnppq ` pn n flu q lnp1 pq (2) where I am using L to denote the log-likelihood of p. This log-likelihood is substantially easier to take derivatives of so lets do it, dlppq dp n flu p ` p 1qpn n fluq 1 p 3

where the 1 comes from the chain rule for derivatives. To maximize, we set the derivative equation to zero and solve for p. Lets do that, dlppq dp n flu p ` p 1qpn n fluq 1 p pn n fluq 1 p ñ n flu p ñ p1 pqn flu ppn n flu q ñ n flu pn flu ppn n flu q ñ n flu pn pn flu ` pn flu ñ n flu pn ñ p n flu n so that our inferred estimate of p is p n flu {n where I use p to denote that it is our estimate of p NOT p itself (p is the parameter while p is the statistic). Coincidentally, this is the exact thing you were taught in 121 to do when calculating p. Namely, number of successes/(total number). So, if you ever wondered why we teach you that then here you go: its the maximum likelihood estimate of the probability (or proportion). Now, its your turn to try and use maximum likelihood estimation on a real dataset. The file WomensHeights.txt contains measurements from 77 womens heights in inches. Your goal is to infer about the population of womens heights using this data. Do the following: 1. Complete steps #1 and #2 by drawing a histogram to confirm that the normal shape is appropriate for this data. Under the normal distribution, the unknown parameters would be the population mean µ and variance σ 2. 2. Write out the joint probability of Y 1,..., Y 7 7. Call this the likelihood of µ and σ 2. In case you have forgotten, if Y i is normal then " PrpY i µ, σ 2 1 q? exp 1 * 2πσ 2 2σ py 2 i µq 2. Technically the above equation is not a probability for Y i but a density of Y i. There is no harm, however, in thinking about it as a probability. 3. In preparation for Step #4 (maximizing), calculate the log-likelihood by taking the natural logarithm of your answer in #2. 4. Find the maximum likelihood estimate for µ but maximizing your log-likelihood in #3 by taking a derivative with respect to µ, setting it equal to 0, and then solving. What is the maximum likelihood estimate for µ? Do you recognize this from 121 (hint: you should)? 5. Now, use your data from the 77 women to get what the maximum likelihood estimate of µ is for this dataset. 0 4

1.3 Multivariate Maximum Likelihood Estimation Turn now to the situation where we have a multivariate observation Y i py i1,..., Y ip q 1 rather than a univariate one. The technique of maximizing the likelihood with a multivariate response is no different than in the univariate case. That is, we take exactly the same steps as we did before. Calculating the joint probability, though, can catch people who aren t careful (which isn t you of course). That is, our data are Y 1,..., Y n where each Y i is a vector of P units of information. So, assuming independent, calculating the joint probability of Y 1,..., Y n is PrpY 1,..., Y n parametersq nź PrpY i parametersq where PrpY i parametersq is itself a probability for the multivariate vector Y i. As an example of multivariate maximum likelihood estimation, consider the following example. According to the theory of left-brain or right-brain dominance, each side of the brain controls different types of thinking. Additionally, people are said to prefer one type of thinking over the other. For example, a person who is left-brained is often said to lean toward mathematical and quantitative thinking while a person who is right-brained is said to be creative and excel in verbal skills. Do people tend to be only left- or right-brained? To test this out, the ACT.txt dataset contains n 117 measurements of student ACT scores on the math and verbal section. Let Y i1 denote student i s score on the math section of the ACT and Y i2 denote the same student s score on the verbal section where i 1,..., n. We can test the sided brain theory by looking at the relationship between Y i1 and Y i2 as described by their joint distribution. So, if we know the joint distribution then we can see if math people tend to NOT be verbal and vice versa. Perform inference for the joint distribution of Y i py i1, Y i2 q 1 from the ACT.txt dataset by doing the following: 1. Complete Steps #1 and #2 by checking if the shape of the multivariate normal distribution for Y i is reasonable by (a) drawing histograms (or smoothed histograms called kernel density estimates ) of Y i1 and Y i2 individually. i 1 (b) drawing a 2-D kernel density estimate of the Y i. If the joint distribution of Y i is MVN then each distribution individually should be normal and the joint distribution should look similar to the pictures you drew in the primer on random variables and their distributions. The parameters of the multivariate normal distribution are the mean vector µ and the covariance matrix Σ. 2. Write out the joint probability of Y 1,..., Y n. To complete this problem, you should know that if Y i is multivariate normal then ˆ P {2 " * 1 PrpY i µ, Σq Σ 1{2 exp 1 2π 2 py i µq 1 Σ 1 py i µq 3. In preparation for Step #4 (maximizing) calculate the log-likelihood by taking the natural logarithm of your answer in #2. 5

4. Find the maximum likelihood estimate for µ by maximizing your log-likelihood in #3 by taking a derivative with respect to µ, setting it equal to 0, and then solving. What is the maximum likelihood estimate for µ? 5. Use your data to get the actual maximum likelihood estimate of µ for this problem. 2 Properties of Maximum Likelihood Estimators Maximum likelihood is particularly popular for inference because of a few really cool properties associated with the maximum likelihood estimates (which I ll abbreviate to MLE for short). First, notice that the MLEs are really just functions of data and, if you got new data, you would get a different answer. Because your data is a random variable, then so is the MLE (the randomness associated with the data gets passed onto the MLE). So, we can ask, if the MLE is a random variable, what then is its distribution? The answer, as it turns out is Normal as long as we have a large sample size. This is referred to as the central limit theorem of MLEs. This basically means that we can use the normal distribution to calculate probabilities associated with the MLE. This is particularly helpful when constructing confidence intervals for the population parameters. Second, the MLE is consistent. Basically, this means that as your sample size increase then the MLE will get closer and closer to the true parameter. You would think that should be a must have property of any estimate but there are some estimates out there for which this isn t true so we ll just be grateful that the MLE is consistent. Finally, the third property of the MLE is called invariance. This just means that the MLE of any function of a parameter is that same function of the MLE. For example, suppose we are interested in logppq from the flu example above. Under invariance of the MLE, the MLE of logppq would be logppq. Again, this seems like it should be obvious but there are techniques for which this isn t true. I only mention these properties here because we are likely (but not certain) to need these as we look at complicated data sets. We ll return to them and go into more detail as needed. 6