STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Similar documents
Statistics: Learning models from data

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

STAT 461/561- Assignments, Year 2015

(1) Introduction to Bayesian statistics

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Stat 5101 Lecture Notes

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

MS&E 226: Small Data

Testing Statistical Hypotheses

Lecture 7 Introduction to Statistical Decision Theory

Central Limit Theorem ( 5.3)

Statistical Inference

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Lecture 8: Information Theory and Statistics

Math 494: Mathematical Statistics

3 Joint Distributions 71

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Better Bootstrap Confidence Intervals

Theory of Maximum Likelihood Estimation. Konstantin Kashin

14.30 Introduction to Statistical Methods in Economics Spring 2009

F & B Approaches to a simple model

Practical Statistics

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Rank-Based Methods. Lukas Meier

Mathematical statistics

McGill University. Faculty of Science. Department of Mathematics and Statistics. Part A Examination. Statistics: Theory Paper

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Composite Hypotheses and Generalized Likelihood Ratio Tests

Testing Statistical Hypotheses

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Lecture 10: Generalized likelihood ratio test

Mathematical statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Stat 231 Exam 2 Fall 2013

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

MS&E 226: Small Data

ST495: Survival Analysis: Hypothesis testing and confidence intervals

Primer on statistics:

Hypothesis testing:power, test statistic CMS:

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Bayesian Inference and MCMC

Lecture 21: October 19

simple if it completely specifies the density of x

Applied Statistics Lecture Notes

Statistical Methods for Astronomy

Bayesian Methods for Machine Learning

Statistical Data Analysis Stat 3: p-values, parameter estimation

A Few Notes on Fisher Information (WIP)

A General Overview of Parametric Estimation and Inference Techniques.

Practice Problems Section Problems

BTRY 4090: Spring 2009 Theory of Statistics

Machine Learning Basics: Maximum Likelihood Estimation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ICES REPORT Model Misspecification and Plausibility

COMP90051 Statistical Machine Learning

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

Introduction to Estimation Methods for Time Series models Lecture 2

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Machine Learning Linear Classification. Prof. Matteo Matteucci

Confidence Distribution

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

The Adequate Bootstrap

Review. December 4 th, Review

Lecture 5 September 19

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

A Very Brief Summary of Statistical Inference, and Examples

STAT 730 Chapter 4: Estimation

Stat 5102 Final Exam May 14, 2015

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Econometrics I, Estimation

5.2 Fisher information and the Cramer-Rao bound

Chapters 9. Properties of Point Estimators

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Introduction to Statistical Analysis

Statistical Methods for Astronomy

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Maximum-Likelihood Estimation: Basic Ideas

First Year Examination Department of Statistics, University of Florida

Frequentist Statistics and Hypothesis Testing Spring

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Math 494: Mathematical Statistics

Chapter 3: Maximum Likelihood Theory

Statistics 135 Fall 2008 Final Exam

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Review and continuation from last week Properties of MLEs

Statistical inference

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Machine Learning 4771

Statistics 135 Fall 2007 Midterm Exam

1 Degree distributions and data

Transcription:

STATS 200: Introduction to Statistical Inference Lecture 29: Course review

Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout this course has been to use the observed data to draw inferences about the underlying process or probability distribution.

Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout this course has been to use the observed data to draw inferences about the underlying process or probability distribution. We discussed: Hypothesis testing deciding whether a particular null hypothesis about the underlying distribution is true or false Estimation fitting a parametric model to this distribution and/or estimating a quantity related to the parameters of this model Standard errors and confidence intervals quantifying the uncertainty of these estimates

Hypothesis testing Goal: Accept or reject a null hypothesis H 0 based on the value of an observed test statistic T. Question #1: How to choose a test statistic T? Question #2: How to decide whether H 0 is true/false based on T?

Hypothesis testing Goal: Accept or reject a null hypothesis H 0 based on the value of an observed test statistic T. Question #1: How to choose a test statistic T? Question #2: How to decide whether H 0 is true/false based on T? In a simple-vs-simple testing problem, there is a best answer to Question #1, which is the likelihood ratio statistic L(X 1,..., X n ) = f 0(X 1,..., X n ) f 1 (X 1,..., X n ). We can equivalently use any monotonic 1-to-1 transformation of this statistic, which is simpler to understand in many examples (e.g. the total count for Bernoulli coin flips).

Hypothesis testing In problems with composite alternatives, there is oftentimes not a single most powerful test against all of the alternative distributions. We instead discussed popular choices of T for several examples:

Hypothesis testing In problems with composite alternatives, there is oftentimes not a single most powerful test against all of the alternative distributions. We instead discussed popular choices of T for several examples: Testing goodness of fit: Histogram methods (Pearson s chi-squared statistic), QQ-plot methods (one-sample Kolmogorov-Smirnov statistic)

Hypothesis testing In problems with composite alternatives, there is oftentimes not a single most powerful test against all of the alternative distributions. We instead discussed popular choices of T for several examples: Testing goodness of fit: Histogram methods (Pearson s chi-squared statistic), QQ-plot methods (one-sample Kolmogorov-Smirnov statistic) Testing if a distribution is centered at 0: One-sample t-statistic, signed-rank statistic, sign statistic

Hypothesis testing In problems with composite alternatives, there is oftentimes not a single most powerful test against all of the alternative distributions. We instead discussed popular choices of T for several examples: Testing goodness of fit: Histogram methods (Pearson s chi-squared statistic), QQ-plot methods (one-sample Kolmogorov-Smirnov statistic) Testing if a distribution is centered at 0: One-sample t-statistic, signed-rank statistic, sign statistic Testing if two samples have the same distribution or mean: Two-sample t-statistic, rank-sum statistic

Hypothesis testing In problems with composite alternatives, there is oftentimes not a single most powerful test against all of the alternative distributions. We instead discussed popular choices of T for several examples: Testing goodness of fit: Histogram methods (Pearson s chi-squared statistic), QQ-plot methods (one-sample Kolmogorov-Smirnov statistic) Testing if a distribution is centered at 0: One-sample t-statistic, signed-rank statistic, sign statistic Testing if two samples have the same distribution or mean: Two-sample t-statistic, rank-sum statistic Testing if the parameters of a model satisfy additional constraints (i.e. belong to a sub-model): Generalized likelihood ratio statistic

Hypothesis testing Question #2: How to decide whether H 0 is true/false based on T? We adopted the frequentist significance testing framework: Control the probability of type I error (falsely rejecting H 0 ) at a target level α, by considering the distribution of T if H 0 were true.

Hypothesis testing Question #2: How to decide whether H 0 is true/false based on T? We adopted the frequentist significance testing framework: Control the probability of type I error (falsely rejecting H 0 ) at a target level α, by considering the distribution of T if H 0 were true. To do this for each test, we either derived the null distribution of T exactly (e.g. t-test), appealed to asymptotic approximations (e.g. the χ 2 distribution for the GLRT), or used computer simulation (e.g. permutation two-sample tests).

Estimation Goal: Fit a parametric probability model to the observed data. For IID data X 1,..., X n f (x θ), we discussed three methods: Method of moments: Equate the sample mean of X i, X 2 i, etc. to the theoretical mean of X, X 2, etc., and solve for θ

Estimation Goal: Fit a parametric probability model to the observed data. For IID data X 1,..., X n f (x θ), we discussed three methods: Method of moments: Equate the sample mean of X i, Xi 2, etc. to the theoretical mean of X, X 2, etc., and solve for θ Maximum likelihood: Pick θ to maximize n i=1 f (X i θ), or equivalently n i=1 log f (X i θ)

Estimation Goal: Fit a parametric probability model to the observed data. For IID data X 1,..., X n f (x θ), we discussed three methods: Method of moments: Equate the sample mean of X i, Xi 2, etc. to the theoretical mean of X, X 2, etc., and solve for θ Maximum likelihood: Pick θ to maximize n i=1 f (X i θ), or equivalently n i=1 log f (X i θ) Bayesian inference: Postulate a prior distribution f Θ (θ) for θ, compute the posterior f Θ X (θ X 1,..., X n ) f Θ (θ) n f (X i θ), i=1 and estimate θ by e.g. the posterior mean

Estimation Goal: Fit a parametric probability model to the observed data. For IID data X 1,..., X n f (x θ), we discussed three methods: Method of moments: Equate the sample mean of X i, Xi 2, etc. to the theoretical mean of X, X 2, etc., and solve for θ Maximum likelihood: Pick θ to maximize n i=1 f (X i θ), or equivalently n i=1 log f (X i θ) Bayesian inference: Postulate a prior distribution f Θ (θ) for θ, compute the posterior f Θ X (θ X 1,..., X n ) f Θ (θ) n f (X i θ), i=1 and estimate θ by e.g. the posterior mean We illustrated how the method of maximum likelihood generalizes to regression models with covariates, where the data Y 1,..., Y n are independent but not identically distributed.

Estimation We discussed the accuracy of estimators in terms of several finite-n properties: The bias E θ [ˆθ] θ The variance Var θ [ˆθ] The MSE E θ [(ˆθ θ) 2 ] = bias 2 + variance

Estimation We discussed the accuracy of estimators in terms of several finite-n properties: The bias E θ [ˆθ] θ The variance Var θ [ˆθ] The MSE E θ [(ˆθ θ) 2 ] = bias 2 + variance We also discussed asymptotic properties, as n and when the true parameter is θ: Consistency: ˆθ θ in probability Asymptotic normality: n(ˆθ θ) N (0, v(θ)) in distribution for some variance v(θ) Asymptotic efficiency: ˆθ is asymptotically normal, and the limiting variance is v(θ) = I (θ) 1, where I (θ) is the Fisher information (of a single observation)

Estimation The definition of asymptotic efficiency was motivated by the Cramer-Rao lower bound: Under mild smoothness conditions, for any unbiased estimator ˆθ of θ, its variance is at least 1 n I (θ) 1.

Estimation The definition of asymptotic efficiency was motivated by the Cramer-Rao lower bound: Under mild smoothness conditions, for any unbiased estimator ˆθ of θ, its variance is at least 1 n I (θ) 1. A major result was that the MLE is asymptotically efficient: n(ˆθ θ) N (0, I (θ) 1 ) We showed informally that Bayes estimators, asymptotically for large n, approach the MLE an implication is that Bayes estimators are usually also asymptotically efficient. On the other hand, method-of-moments estimators are oftentimes not asymptotically efficient and have a larger mean-squared error than the other two procedures for large n.

Standard errors and confidence intervals We discussed standard error estimates and confidence intervals both when the model is correctly specified and when it is not. In a correctly specified model, we can derive the variance v(θ) of the estimate ˆθ in terms of the true parameter θ, and estimate the standard error by the plugin estimate v(ˆθ).

Standard errors and confidence intervals We discussed standard error estimates and confidence intervals both when the model is correctly specified and when it is not. In a correctly specified model, we can derive the variance v(θ) of the estimate ˆθ in terms of the true parameter θ, and estimate the standard error by the plugin estimate v(ˆθ). For the MLE, asymptotic efficiency implies that v(θ) 1 n I (θ) 1 for large n. In the setting of non-iid data Y 1,..., Y n in regression models, we used the Fisher information of all n samples, I Y (θ) 1, in place of 1 n I (θ) 1. For other estimators, we can sometimes derive v(θ) directly from the form of ˆθ, perhaps using asymptotic approximations like the CLT and delta method.

Standard errors and confidence intervals In an incorrectly specified model, we studied the behavior of the MLE ˆθ and interpreted the parameter θ that it tries to estimate as the probability distribution in the model closest in KL-divergence to the true distribution of data. We derived a more general formula for the variance of ˆθ, and showed how this variance may be estimated by a sandwich estimator. Alternatively, we discussed the nonparametric bootstrap as a simulation-based approach for estimating the standard error that is also robust to model misspecification.

Standard errors and confidence intervals To estimate a function g(θ), we may use the plugin estimate g(ˆθ). If ˆθ is asymptotically normal, then g(ˆθ) is usually also asymptotically normal, and its asymptotic variance may be derived using the delta method. If ˆθ is asymptotically efficient (e.g. the MLE), then g(ˆθ) is also asymptotically efficient for g(θ).

Standard errors and confidence intervals To estimate a function g(θ), we may use the plugin estimate g(ˆθ). If ˆθ is asymptotically normal, then g(ˆθ) is usually also asymptotically normal, and its asymptotic variance may be derived using the delta method. If ˆθ is asymptotically efficient (e.g. the MLE), then g(ˆθ) is also asymptotically efficient for g(θ). For any quantity θ, an approximate level 100(1 α)% confidence interval for θ may be obtained from any asymptotically normal estimator ˆθ and an estimate ŝe of its standard error: ˆθ ± z(α/2)ŝe Most confidence intervals that we constructed in this class were of this form. (The accuracy of such intervals should be checked by simulation if n is small.)

Final exam You are allowed one 8.5 11 inch cheat sheet, front and back. No other outside material is permitted. The exam will provide relevant formulas (like the PDFs/PMFs of common distributions), as was done on the midterm. Some material is from before the midterm, but the focus is on Units 2 and 3. The exam will test conceptual understanding and problem solving. You won t be asked to reproduce detailed mathematical proofs. We were informal in our discussion of regularity conditions required for asymptotic efficiency of the MLE, asymptotic χ 2 distribution of the GLRT statistic 2 log Λ, etc. For parametric models having differentiable likelihood function and common support, you don t need to check regularity conditions when applying these results.

One last example: Beyond the MLE We live in a time when there is a convergence of ideas and interchange of tools across quantitative disciplines. There is a rich interplay between statistical inference and optimization, algorithms, machine learning, and information theory. Here is one last example which illustrates how the idea of the MLE, in a seemingly simple problem, connects to interesting questions in a variety of other fields of study.

One last example: Beyond the MLE n people, belonging to two distinct communities of equal size n/2, are connected in a social network. Every pair of people is connected (independently) with probability p if they are in the same community, and with probability q if they are in different communities, where q < p. We can see the network of connections, but we cannot see which person belongs to which community. Question: Suppose (for simplicity) that we know the values p and q. How can we infer who belongs to which community?

One last example: Beyond the MLE Let S {1,..., n} denote community 1, and S c denote community 2.

One last example: Beyond the MLE Let S {1,..., n} denote community 1, and S c denote community 2. Let Same be the set of pairs {i, j} such that i, j S or i, j S c, and let Different be the set of pairs {i, j} such that one of i, j belongs to S and the other to S c.

One last example: Beyond the MLE Let S {1,..., n} denote community 1, and S c denote community 2. Let Same be the set of pairs {i, j} such that i, j S or i, j S c, and let Different be the set of pairs {i, j} such that one of i, j belongs to S and the other to S c. Our observed data are the connections in this network: { 1 if {i, j} are connected in the network A ij = 0 otherwise Under our model, A ij are independent Bernoulli random variables with A ij Bernoulli(p) if {i, j} Same and A ij Bernoulli(q) if {i, j} Different.

One last example: Beyond the MLE The likelihood function is lik(s) = p A ij (1 p) 1 A ij = {i,j} Same {i,j} Same (1 p) {i,j} Different ( p ) Aij 1 p {i,j} Different q A ij (1 q) 1 A ij ( q (1 q) 1 q ) Aij

One last example: Beyond the MLE The likelihood function is lik(s) = p A ij (1 p) 1 A ij = {i,j} Same {i,j} Same (1 p) {i,j} Different ( p ) Aij 1 p {i,j} Different q A ij (1 q) 1 A ij ( q (1 q) 1 q Each observed connection (where A ij = 1) contributes a factor of p/(1 p) to the likelihood if {i, j} Same, and a factor of q/(1 q) if {i, j} Different. Since p > q, the MLE Ŝ is given by the partition of the people into two communities that minimizes the number of observed connections between communities (or, equivalently, maximizes the number of connections within communities). ) Aij

One last example: Beyond the MLE More formally, the MLE Ŝ solves the optimization problem minimize A ij i S j S c subject to S = S c = n/2 This is a well-known problem in computer science, called the minimum graph bisection problem.

One last example: Beyond the MLE More formally, the MLE Ŝ solves the optimization problem minimize A ij i S j S c subject to S = S c = n/2 This is a well-known problem in computer science, called the minimum graph bisection problem. Unfortunately, this problem is known to be NP-complete it is widely believed that no computationally efficient computer algorithm can compute this MLE (for all possible realizations of the network).

One last example: Beyond the MLE This leads to a number of interesting questions: Can we approximately solve this optimization problem, and prove that our answer is not too far off? Are there other algorithms (not directly based on this optimization) that can yield a good estimate of S? What is a lower bound for the error (expected fraction of people assigned to the incorrect community) achievable by any estimator? How robust are our algorithms to the modeling assumptions, and how well do they generalize to settings with more than two communities? These questions have attracted the attention of people working in statistics, mathematics, computer science, statistical physics, and optimization, and they remain an active area of research today.