MS&E 226: Small Data

Similar documents
MS&E 226: Small Data

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Regression #3: Properties of OLS Estimator

Maximum-Likelihood Estimation: Basic Ideas

MS&E 226: Small Data

Bias Variance Trade-off

MS&E 226: Small Data

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation Least Squares and Maximum Likelihood

MS&E 226: Small Data

Regression Models - Introduction

review session gov 2000 gov 2000 () review session 1 / 38

Swarthmore Honors Exam 2012: Statistics

Simple Linear Regression

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Statistical Distribution Assumptions of General Linear Models

Problems. Suppose both models are fitted to the same data. Show that SS Res, A SS Res, B

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

MS&E 226: Small Data

MS&E 226: Small Data

5.2 Fisher information and the Cramer-Rao bound

Section 3: Simple Linear Regression

(1) Introduction to Bayesian statistics

MS&E 226. In-Class Midterm Examination Solutions Small Data October 20, 2015

Simple Linear Regression for the Climate Data

Simple and Multiple Linear Regression

POLI 8501 Introduction to Maximum Likelihood Estimation

Statistics 910, #5 1. Regression Methods

The Simple Linear Regression Model

Last few slides from last time

Weighted Least Squares

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Lectures on Simple Linear Regression Stat 431, Summer 2012

A General Overview of Parametric Estimation and Inference Techniques.

CS 361: Probability & Statistics

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Chapter 7: Sampling Distributions

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Terminology. Experiment = Prior = Posterior =

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Lectures 5 & 6: Hypothesis Testing

Properties of the least squares estimates

Part Possible Score Base 5 5 MC Total 50

Regression Analysis: Basic Concepts

F9 F10: Autocorrelation

Lecture 2: Statistical Decision Theory (Part I)

AMS 7 Correlation and Regression Lecture 8

Sampling distribution of GLM regression coefficients

For more information about how to cite these materials visit

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

CSC321 Lecture 18: Learning Probabilistic Models

MS&E 226: Small Data

Linear Model Selection and Regularization

CS 361: Probability & Statistics

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

F & B Approaches to a simple model

Statistical Data Analysis Stat 3: p-values, parameter estimation

MS&E 226: Small Data. Lecture 6: Bias and variance (v2) Ramesh Johari

ECO220Y Simple Regression: Testing the Slope

Regression Models - Introduction

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Categorical Predictor Variables

Econometrics. Week 4. Fall Institute of Economic Studies Faculty of Social Sciences Charles University in Prague

WEIGHTED LEAST SQUARES. Model Assumptions for Weighted Least Squares: Recall: We can fit least squares estimates just assuming a linear mean function.

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Probability and Statistics Notes

Ling 289 Contingency Table Statistics

Statistics: Learning models from data

Likelihood-Based Methods

Chapter 7: Sampling Distributions

Better Bootstrap Confidence Intervals

GLS and FGLS. Econ 671. Purdue University. Justin L. Tobias (Purdue) GLS and FGLS 1 / 22

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Economics 583: Econometric Theory I A Primer on Asymptotics

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

BIOS 6649: Handout Exercise Solution

Applied Regression Analysis

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Regression I: Mean Squared Error and Measuring Quality of Fit

Simple Linear Regression

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Multiple Linear Regression

UNIVERSITY OF TORONTO Faculty of Arts and Science

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Frequentist Statistics and Hypothesis Testing Spring

Lecture 5: Clustering, Linear Regression

Lecture 14 Simple Linear Regression

ECON The Simple Regression Model

Introduction to Simple Linear Regression

Transcription:

MS&E 226: Small Data Lecture 12: Frequentist properties of estimators (v4) Ramesh Johari ramesh.johari@stanford.edu 1 / 39

Frequentist inference 2 / 39

Thinking like a frequentist Suppose that for some population distribution with parameters θ, you have a process that takes observations Y and constructs an estimator ˆθ. How can we quantify our uncertainty in ˆθ? By this we mean: how sure are we of our guess? We take a frequentist approach: How well would our procedure would do if it was repeated many times? 3 / 39

Thinking like a frequentist The rules of thinking like a frequentist: The parameters θ are fixed (non-random). Given the fixed parameters, there are many possible realizations ( parallel universes ) of the data given the parameters. We get one of these realizations, and use only the universe we live in to reason about the parameters. 4 / 39

The sampling distribution The distribution of ˆθ if we carry out this parallel universe simulation is called the sampling distribution. The sampling distribution is the heart of frequentist inference! Nearly all frequentist quantities we derive come from the sampling distribution. The idea is that the sampling distribution quantifies our uncertainty, since it captures how much we expect the estimator to vary if we were to repeat the procedure over and over ( parallel universes ). 5 / 39

Example: Flipping a biased coin Suppose we are given the outcome of n = 10 flips of a biased coin, and take the sample average of these ten flips as our estimate of the true bias q (i.e., compute the maximum likelihood estimator of the bias). What is the sampling distribution of this estimator? 6 / 39

Example: Flipping a biased coin Suppose true bias is q = 0.20. 100,000 parallel universes, with n = 10 flips in each: 30000 20000 count 10000 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sample average 7 / 39

Example: Flipping a biased coin Suppose true bias is q = 0.5. 100,000 parallel universes, with n = 10 flips in each: 25000 20000 15000 count 10000 5000 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sample average 8 / 39

Example: Flipping a biased coin Suppose true bias is q = 0.65. 100,000 parallel universes, with n = 10 flips in each: 25000 20000 15000 count 10000 5000 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Sample average 9 / 39

Example: Flipping a biased coin Suppose true bias is q = 0.20. 100,000 parallel universes, with n = 1000 flips in each: 3000 2000 count 1000 0 0.15 0.20 0.25 Sample average 10 / 39

Example: Flipping a biased coin Suppose true bias is q = 0.5. 100,000 parallel universes, with n = 1000 flips in each: 2000 count 1000 0 0.45 0.50 0.55 Sample average 11 / 39

Example: Flipping a biased coin Suppose true bias is q = 0.65. 100,000 parallel universes, with n = 1000 flips in each: 2000 count 1000 0 0.60 0.65 0.70 Sample average 12 / 39

Example: Flipping a biased coin The sampling distribution is what we get if the number of parallel universes. In this case: the sampling distribution is 1 nbinomial(n, q). Note that: In this case, the mean of the sampling distribution is the true parameter. This need not always be true. The standard deviation of the sampling distribution is called the standard error (SE) of the estimator. 1 The sampling distribution looks asymptotically normal as n. 1 We also use the term standard error to refer to an estimate of the true SE, computed from a specific sample. In this case we write ŜE to clarify the dependence on the observed data. 13 / 39

Using the sampling distribution How might you use the sampling distribution? Suppose you receive 10 flips of a coin, of which 2 are heads. You compute the MLE, ˆq MLE = 0.2. To quantify your uncertainty in this guess, you can use the sampling distribution in two ways: First, you might simulate the sampling distribution with n = 10, assuming the true bias q was actually equal to your estimate of 0.2. You can use, e.g., the standard deviation of this simulated distribution to measure your uncertainty; this is an example of an estimated standard error (ŜE). Second, you might simulate the sampling distribution with n = 10, assuming a fixed value for the true bias, e.g., q = 0.5. With this simulation you can answer the following question: if the true bias were q, what is the chance that you would have observed the MLE that you did? We discuss both approaches to quantifying uncertainty in the subsequent slides. 14 / 39

Analyzing the MLE 15 / 39

Bias and consistency of the MLE The MLE may not be unbiased with finite n. However, the MLE is converges to the true parameter when the amount of data grows. This is called consistency. Theorem The MLE is consistent: If the true parameter is ˆθ, then as n, there holds ˆθ MLE θ in probability. 16 / 39

Asymptotic normality of the MLE Theorem For large n, the sampling distribution of the MLE ˆθ MLE is approximately a normal distribution (with mean that is the true parameter θ). The variance, and thus the standard error, can be explicitly characterized in terms of the Fisher information of the population model; see Theorem 9.18 in [AoS]. 17 / 39

Example 1: Biased coin flipping Let ˆq MLE be MLE estimate of q, the bias of the coin; recall it is just the sample average: ˆq MLE = 1 n n Y i = Y. i=1 ˆq MLE is unbiased, regardless of n. The standard error of ˆq MLE can be computed directly: SE = Var(ˆq MLE ) = 1 n q(1 q) n 2 q(1 q) =. n i=1 We can estimate the standard error of ˆq by ŜE = ˆq MLE (1 ˆq MLE )/n. For large n, ˆq MLE is approximately normal, with mean q and variance ŜE2. 18 / 39

Example 2: Linear normal model Recall that in this case, ˆβ MLE is the OLS solution. It is unbiased (see Lecture 5), regardless of n. The covariance matrix of ˆβ is given by σ 2 (X X) 1. (Using similar analysis to Lecture 5.) In particular, the standard error SE j of ˆβ j is σ times the square root of the j th diagonal entry of this matrix. To estimate this covariance matrix (and hence SE j ), we use an estimator ˆσ 2 for σ 2. For large n, ˆβ is approximately normal. 19 / 39

Example 2: Linear normal model What estimator ˆσ 2 to use? Recall that ˆσ 2 MLE is the following sample variance: ˆσ 2 MLE = 1 n n ri 2. i=1 But this is not unbiased! In fact it can be shown that: E Y [ˆσ 2 MLE β, σ2, X] = n p n σ2. 20 / 39

Example 2: Linear normal model In other words, ˆσ 2 MLE underestimates the true error variance. This is because the MLE solution ˆβ was chosen to minimize squared error on the training data. We need to account for this favorable selection of the variance estimate by reinflating it. So an unbiased estimate of σ 2 is: ˆσ 2 = 1 n p n (Y i X i ˆβ) 2. i=1 The quantity n p is called the degrees of freedom (DoF). 21 / 39

Example 2: Linear normal model R output after running a linear regression: lm(formula = Ozone ~ 1 + Solar.R + Wind + Temp, data = airquality) coef.est coef.se (Intercept) -64.34 23.05 Solar.R 0.06 0.02 Wind -3.33 0.65 Temp 1.65 0.25 --- n = 111, k = 4 residual sd = 21.18, R-Squared = 0.61 22 / 39

Example 2: Linear normal model In the case of simple linear regression (one covariate with intercept), we can explicitly calculate the standard error of ˆβ 0 and ˆβ 1 : ŜE 0 = ŜE 1 = ˆσ n i=1 X2 i ˆσ X n n ˆσ, ˆσ X n where ˆσ is the estimate of standard deviation of the error. Note that both standard errors decrease proportional to 1/ n. This is always true for standard errors in linear regression. ; 23 / 39

Standard error vs. estimated standard error Note that the standard error depends on the unknown parameter(s)! Biased coin flipping: SE depends on q. Linear normal model: SE depends on σ 2. This makes sense: the sampling distribution is determined by the unknown parameter(s), and so therefore the standard deviation of this distribution must also depend on the unknown parameter(s). In each case we estimate the standard error, by plugging in an estimate for the unknown parameter(s). 24 / 39

Additional properties of the MLE Two additional useful properties of the MLE: It is equivariant: If you want the MLE of a function of θ, say g(θ), you can get it by just computing g(ˆθ MLE ). It is asymptotically optimal: As the amount of data increases, the MLE has asymptotically lower variance than any other asymptotically normal consistent estimator of θ you can construct (in a sense that can be made precise). 25 / 39

Additional properties of the MLE Asymptotic optimality is also referred to as asymptotic efficiency of the MLE. The idea is that the MLE is the most informative estimate of the true parameter(s), given the data. However, it s important to also understand when MLE estimates might not be efficient. A clue to this is provided by the fact that the guarantees for MLE performance are all asymptotic as n grows large (consistency, normality, efficiency). In general, when n is not large (and in particular, when the number of covariates is large relative to n), the MLE may not be efficient. 26 / 39

Confidence intervals 27 / 39

Quantifying uncertainty The combination of the standard error, consistency, and asymptotic normality allow us to quantify uncertainty directly through confidence intervals: In particular, for large n: 28 / 39

Quantifying uncertainty The combination of the standard error, consistency, and asymptotic normality allow us to quantify uncertainty directly through confidence intervals: In particular, for large n: The sampling distribution of the MLE ˆθ MLE is approximately normal with mean θ, and standard deviation ŜE. 28 / 39

Quantifying uncertainty The combination of the standard error, consistency, and asymptotic normality allow us to quantify uncertainty directly through confidence intervals: In particular, for large n: The sampling distribution of the MLE ˆθ MLE is approximately normal with mean θ, and standard deviation ŜE. A normal distribution has 95% of its mass within 1.96 standard deviations of the mean. 28 / 39

Quantifying uncertainty The combination of the standard error, consistency, and asymptotic normality allow us to quantify uncertainty directly through confidence intervals: In particular, for large n: The sampling distribution of the MLE ˆθ MLE is approximately normal with mean θ, and standard deviation ŜE. A normal distribution has 95% of its mass within 1.96 standard deviations of the mean. Therefore, in 95% of our universes, ˆθ MLE will be within 1.96 ŜE of the true value of θ. 28 / 39

Quantifying uncertainty The combination of the standard error, consistency, and asymptotic normality allow us to quantify uncertainty directly through confidence intervals: In particular, for large n: The sampling distribution of the MLE ˆθ MLE is approximately normal with mean θ, and standard deviation ŜE. A normal distribution has 95% of its mass within 1.96 standard deviations of the mean. Therefore, in 95% of our universes, ˆθ MLE will be within 1.96 ŜE of the true value of θ. In other words: in 95% of our universes: ˆθ MLE 1.96 ŜE θ ˆθ MLE + 1.96 ŜE. 28 / 39

Confidence intervals We refer to [ˆθ MLE 1.96 ŜE, ˆθ MLE + 1.96 ŜE] as a 95% confidence interval for θ. More generally, let z α be the unique value such that P (Z z α ) = 1 α for N (0, 1) random variable. Then: [ˆθ MLE z α/2 ŜE, ˆθ MLE + z α/2 ŜE] is a 1 α confidence interval for θ. In R, you can get z α using the qnorm function. 29 / 39

Confidence intervals Comments: Note that the interval is random, and θ is fixed! When α = 0.05, then z α/2 1.96. Confidence intervals can always be enlarged; so the goal is to construct the smallest interval possible that has the desired property. Other approaches to building 1 α confidence intervals are possible, that may yield asymmetric intervals. 30 / 39

Example: Linear regression In the regression Ozone ~ 1 + Solar.R + Wind + Temp, the coefficient on Temp is 1.65, with ŜE = 0.25. Therefore a 95% confidence interval for this coefficient is: [1.16, 2.14]. 31 / 39

Example: Linear regression If zero is not in the 95% confidence interval for a particular regression coefficient ˆβ j, then we say that the coefficient is statistically significant at the 95% level. Why? 32 / 39

Example: Linear regression If zero is not in the 95% confidence interval for a particular regression coefficient ˆβ j, then we say that the coefficient is statistically significant at the 95% level. Why? Suppose the true value of β j is zero. 32 / 39

Example: Linear regression If zero is not in the 95% confidence interval for a particular regression coefficient ˆβ j, then we say that the coefficient is statistically significant at the 95% level. Why? Suppose the true value of β j is zero. Then the sampling distribution of ˆβ j is approximately N (0, ŜE2 j). 32 / 39

Example: Linear regression If zero is not in the 95% confidence interval for a particular regression coefficient ˆβ j, then we say that the coefficient is statistically significant at the 95% level. Why? Suppose the true value of β j is zero. Then the sampling distribution of ˆβ j is approximately N (0, ŜE2 j). So the chance of seeing ˆβ j that is more than 1.96ŜE j away from zero is 5%. 32 / 39

Example: Linear regression If zero is not in the 95% confidence interval for a particular regression coefficient ˆβ j, then we say that the coefficient is statistically significant at the 95% level. Why? Suppose the true value of β j is zero. Then the sampling distribution of ˆβ j is approximately N (0, ŜE2 j). So the chance of seeing ˆβ j that is more than 1.96ŜE j away from zero is 5%. In other words: this event is highly unlikely if the true coefficient were actually zero. (This is our first example of a hypothesis test; more on that later.) 32 / 39

More on statistical significance: A picture 33 / 39

More on statistical significance Lots of warnings: Statistical significance of a coefficient suggests it is worth including in your regression model; but don t forget all the other assumptions that have been made along the way! Conversely, just because a coefficient is not statistically significant, does not mean that it is not important to the model! Statistical significance is very different from practical significance! Even if zero is not in a confidence interval, the relationship between the corresponding covariate and the outcome may still be quite weak. 34 / 39

Concluding thoughts on frequentist inference 35 / 39

A thought experiment You run business intelligence for an e-commerce startup. Every day t your marketing department gives you the number of clicks from each visitor i to your site that day (V (t) i ), and your sales department hands you the amount spent by each of those visitors (R (t) i ). 2 Ignore seasonality: let s suppose the true value of this multiplier is the same every day. 36 / 39

A thought experiment You run business intelligence for an e-commerce startup. Every day t your marketing department gives you the number of clicks from each visitor i to your site that day (V (t) i ), and your sales department hands you the amount spent by each of those visitors (R (t) i ). Every day, your CEO asks you for an estimate of how much each additional click by a site visitor is worth. 2 So you: Run a OLS linear regression of R (t) on V (t). (t) (t) Compute intercept ˆβ 0 and slope ˆβ 1. (t) Report ˆβ 1. But your boss asks: How sure are you of your guess? 2 Ignore seasonality: let s suppose the true value of this multiplier is the same every day. 36 / 39

A thought experiment Having taken MS&E 226, you also construct a 95% confidence (t) interval for your guess ˆβ 1 each day: You tell your boss: C (t) (t) = [ ˆβ 1 1.96 ŜE(t) (t) 1, ˆβ 1 + 1.96 ŜE(t) 1 ]. I don t know what the real β 1 is, but I am 95% sure it lives in the confidence interval I give you each day. After one year, your boss goes to an industry conference and discovers the true value of β 1, and now he looks back at the guesses you gave him every day. 37 / 39

A thought experiment How does he evaluate you? A picture: 38 / 39

The benefit of frequentist inference This example lets us see why frequentist evaluation can be helpful. More generally, the meaning of reporting 95% confidence intervals is that you trap the true parameter in 95% of the claims that you make, even across different estimation problems. (See Section 6.3.2 of [AoS].) This is the defining characteristic of estimation procedures with good frequentist properties: they hold up to scrutiny when repeatedly used. 39 / 39