Lecture 9: September 19

Similar documents
LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Exponential Families and Bayesian Inference

Lecture 12: September 27

1.010 Uncertainty in Engineering Fall 2008

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Introductory statistics

Chapter 6 Principles of Data Reduction

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

ECE 901 Lecture 13: Maximum Likelihood Estimation

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Stat410 Probability and Statistics II (F16)

Lecture 3. Properties of Summary Statistics: Sampling Distribution

Lecture 11 and 12: Basic estimation theory

STATISTICAL INFERENCE

Topic 9: Sampling Distributions of Estimators

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Topic 9: Sampling Distributions of Estimators

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Lecture 33: Bootstrap

Summary. Recap ... Last Lecture. Summary. Theorem

5. Likelihood Ratio Tests

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Topic 9: Sampling Distributions of Estimators

Lecture 2: Monte Carlo Simulation

Bayesian Methods: Introduction to Multi-parameter Models

Lecture 13: Maximum Likelihood Estimation

2 Definition of Variance and the obvious guess

Last Lecture. Wald Test

Rates of Convergence by Moduli of Continuity

Lecture 10 October Minimaxity and least favorable prior sequences

Solutions: Homework 3

AAEC/ECON 5126 FINAL EXAM: SOLUTIONS

Department of Mathematics

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

MATH/STAT 352: Lecture 15

Estimation of a population proportion March 23,

Lecture Notes 15 Hypothesis Testing (Chapter 10)

Estimation for Complete Data

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Statistical Pattern Recognition

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Parameter, Statistic and Random Samples

CSE 527, Additional notes on MLE & EM

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Random Variables, Sampling and Estimation

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Questions and Answers on Maximum Likelihood

1 Introduction to reducing variance in Monte Carlo simulations

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

6.3 Testing Series With Positive Terms

Problem Set 4 Due Oct, 12

Understanding Samples

4. Partial Sums and the Central Limit Theorem

Expectation-Maximization Algorithm.

Lecture 7: Properties of Random Samples

n n i=1 Often we also need to estimate the variance. Below are three estimators each of which is optimal in some sense: n 1 i=1 k=1 i=1 k=1 i=1 k=1

Simulation. Two Rule For Inverting A Distribution Function

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Confidence intervals summary Conservative and approximate confidence intervals for a binomial p Examples. MATH1005 Statistics. Lecture 24. M.

Output Analysis and Run-Length Control

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Binomial Distribution

Stat 421-SP2012 Interval Estimation Section

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Statistics 511 Additional Materials

Frequentist Inference

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Last Lecture. Biostatistics Statistical Inference Lecture 16 Evaluation of Bayes Estimator. Recap - Example. Recap - Bayes Estimator

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

Module 1 Fundamentals in statistics

Lecture 19: Convergence

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

5 : Exponential Family and Generalized Linear Models

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Unbiased Estimation. February 7-12, 2008

Lecture 3: August 31

Distribution of Random Samples & Limit theorems

Infinite Sequences and Series

Monte Carlo Integration

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Axis Aligned Ellipsoid

TAMS24: Notations and Formulas

Lecture 11 October 27

Lecture Stat Maximum Likelihood Estimation

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

x = Pr ( X (n) βx ) =

Element sampling: Part 2

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

SDS 321: Introduction to Probability and Statistics

SOME THEORY AND PRACTICE OF STATISTICS by Howard G. Tucker

Transcription:

36-700: Probability ad Mathematical Statistics I Fall 206 Lecturer: Siva Balakrisha Lecture 9: September 9 9. Review ad Outlie Last class we discussed: Statistical estimatio broadly Pot estimatio Bias-Variace tradeoff Cofidece sets I this lecture we will go ito more detail about poit estimatio. Particularly, we will discuss methods for costructig estimators. This is i Chapter 7 of Casella ad Berger. I the ext lecture we will discuss methods for evaluatig estimators. 9.2 Fidig estimators Broadly poit estimatio is aliged with the geeral statistical pursuit of tryig to uderstad a uderlyig populatio from a sample. Particularly, uder the hypothesis that the sample was geerated from some parametric statistical model, a atural way to uderstad the uderlyig populatio is by estimatig the parameters of the statistical model. Oe ca of course woder what happes if the model is wrog? Or where do models really come from? I some rare cases, we actually kow eough about our data to hypothesise a reasoable model. Most ofte however, whe we specify a model, we do so hopig that it ca provide a useful approximatio to the data geeratio mechaism. The George Box quote is worth rememberig i this cotext: all models are wrog, but some are useful. 9-

9-2 Lecture 9: September 9 9.2. Method of momets We saw this idea before we will see a couple of other examples today. Broadly, the idea is that we ca estimate the sample momets i a straightforward way: m = m 2 =. m k = X i, Xi 2, Xi k. O the other had the populatio momets are some fuctios of the ukow parameters. We ca deote the populatio momets as: µ (θ,..., θ k ),..., µ k (θ,..., θ k ). The method-of-momets prescribes estimatig the parameters: θ,..., θ k by solvig the system of equatios: m = µ (θ,..., θ k ). m k = µ k (θ,..., θ k ). We already saw a applicatio of this idea to estimatig the mea ad variace of a Gaussia: Example : to obtai the estimators: If X,..., X N(θ, σ 2 ), we would solve: X i = θ, ˆθ = ˆσ 2 = = ad X i Xi 2 ( ( X i Xi 2 = θ 2 + σ 2, ) 2 X i ) 2 X j, j=

Lecture 9: September 9 9-3 which of course are atural estimators of these quatities. You ca see that the method-ofmomets estimators ca be biased sice i this example the estimator for the variace has o-zero bias. Example 2: Suppose X,..., X Bi(k, p), i.e. each X i is the sum of k Ber(p) RVs, ad that both k ad p are ukow. Casella ad Berger poit out that this model has bee used to model crimes, where usually either the total umber of crimes k or the reportig rate p are kow. Let us first compute the first two momets of the biomial: µ = kp µ 2 = kp( p) + k 2 p 2. Deotig, X = X i, we solve the system of equatios: This yields the estimators: X = kp Xi 2 = kp( p) + k 2 p 2. ˆp = X (X i X) 2, X X ˆk 2 = X (X i X). 2 It is worth poitig out that i this case there are o a-priori obvious estimators for the parameters of iterest. Further, the method-of-momets estimators i this case ca be egative (while the true parameters caot be). However, whe the method-of-momets estimator is egative it is because the sample mea is domiated by the sample variace - which is a sig of excessive variability. Aother importat use of the method-of-momets i statistics is i approximatig complex distributios by simpler oes. This is ofte called momet matchig. Example 3: The simplest example of momet matchig comes from the cetral limit theorem. Suppose that RVs X,..., X are i.i.d with kow distributio (it suffices if just

9-4 Lecture 9: September 9 their meas ad variaces are kow). Typically the exact distributio of: S = X i, ca be quite uwieldy. Oe possibility would be to try to approximate this distributio by that of a Gaussia distributio. We do this via the method of momets: ˆθ = E(S ) = µ i, ˆσ 2 = E(S E(S )) 2 = σi 2. Thus we could approximate the distributio of the sum by the Gaussia N(ˆθ, ˆσ 2 ). More geerally, we ca always try to approximate a complex distributio by a much simpler oe i this way. 9.2.2 Maximum Likelihood Estimatio The most popular techique to derive estimators is via the priciple of maximum likelihood. Suppose that X,..., X f θ where f θ deotes either the pmf or pdf of the populatio. The likelihood fuctio is defied by: L(θ X,..., X ) = f θ (X i ). A maximum likelihood estimator of θ based o the sample {X,..., X } is ay value of the parameter at which the fuctio L(θ X,..., X ) is maximized (as a fuctio of θ with X,..., X held fixed). Ulike the method-of-momets estimator we ca see that the MLE will always obey atural parameter restrictios, i.e. the rage of the MLE estimator is the same as the rage of the parameter. Ituitively, the MLE is just the value of the parameter which is most likely to have geerated the sample ad thus is a atural choice as a poit estimate. It also has may optimality properties i certai settigs: ideed, much of the early theoretical developmet i statistics revolved aroud showig the optimality of MLE i various seses. There are some commo difficulties with usig the MLE: the first is computatioal, how do we fid the global maximum of the likelihood fuctio ad how do we verify/certify that it is i fact the global maximum? The ext two are related to umerical sesitivity: () we ca ofte oly approximately maximize the likelihood (usig a umerical procedure like gradiet descet) so we eed that the likelihood is a well-behaved fuctio of the parameters. (2)

Lecture 9: September 9 9-5 Similarly, it is also atural to hope that the estimator is a well-behaved fuctio of the data, i.e., we would hope that a slightly differet sample would ot result i a vastly differet MLE. The typical way to compute the MLE (suppose that we have k ukow parameters) is to either aalytically or umerically solve the system of equatios: Example : θ i L(θ X,..., X ) = 0 i =,..., k. Suppose X,..., X N(θ, ), the the likelihood fuctio is give as: L(θ X,..., X ) = exp( (X i θ) 2 /2). 2π A frequetly useful simplificatio is to observe that if θ maximizes L(θ X,..., X ) the θ also maximizes log L(θ X,..., X ). We ca also of course igore costats. So we have that: So that we obtai: ˆθ = arg max log L(θ X,..., X ) = arg mi θ θ ˆθ = X i. (X i θ) 2. We eed to further verify that the secod derivative (of the egative log-likelihood) is egative: i this case the secod derivative is 2. Example 2: which is maximized at Suppose that X,..., X Ber(p), the the log-likelihood is give by log L(p X,..., X ) X i log p + ( X i ) log( p) = X log p + ( X) log( p), ˆp = X. 9.2.3 Ivariace of the MLE Roughly, the ivariace property of the MLE states: suppose that the MLE for a parameter θ is give by ˆθ the the MLE for a parameter τ(θ) is τ(ˆθ). Let us deote the likelihood as a fuctio of the trasformed parameter by L (η X,..., X ). For ivariace we eed to cosider two cases:

9-6 Lecture 9: September 9. τ is a ivertible trasformatio: I this case there is a mappig from the likelihood of a poit η to a poit θ = τ (η), i.e. ad we have that the MLE for η: ˆη = arg sup η L (η X,..., X ) = L(τ (η) X,..., X ), L (η X,..., X ) = arg sup η L(τ (η) X,..., X ) = τ(arg sup L(θ X,..., X )) = τ(ˆθ). θ 2. Whe τ is ot ivertible: I this case, it is possible that several parameters θ, θ 2,... map to the same value η, so we eed some care i defiig the iduced likelihood. Particularly, if we defie: L (η X,..., X ) = the oe ca verify that the MLE is ivariat. sup L(θ X,..., X ), θ:τ(θ)=η We will discuss other properties of the MLE i future lectures. 9.3 Bayes Estimators The third geeral method to derive estimators is a bit differet philosophically from the first two. At a high-level we eed to uderstad the Bayesia approach (we will of course stay clear of ay philosophical questios): i our approaches so far (so-called frequetist approaches) we assumed that there was a fixed but ukow true parameter, ad that we observed samples draw i.i.d from the populatio (whose desity/mass fuctio/distributio was idexed by the ukow parameter). I the Bayesia approach we cosider the parameter θ to be radom. We have a prior belief of the distributio of the parameter, which we update after seeig the samples X,..., X. We update our belief usig Bayes rule ad the updated distributio is kow as the posterior distributio. We deote the prior distributio by π(θ), ad the posterior distributio as π(θ X,..., X ). Usig Bayes rule we have that: π(θ X,..., X ) = π(θ)l(θ X,..., X ) θ π(θ)l(θ X,..., X ). The posterior distributio is a distributio over the possible parameter values. I this lecture we are focusig o poit estimatio so oe commo cadidate is the posterior mea (i.e. the expected value of the posterior distributio).

Lecture 9: September 9 9-7 Igorig ay philosophical questios, oe ca view this methodology as a way to geerate cadidate poit estimators, by specifyig reasoable prior distributios. I practice however this calculatio is hard to do aalytically, so we ofte ed up specifyig priors out of coveiece rather tha ay real prior belief. Particularly, a coveiet choice of prior distributio is oe for which the posterior distributio belogs to the same family as the prior distributio: such priors are called cojugate priors. Example : Biomial Bayes estimator: Suppose X,..., X Ber(p). We will first eed to defie the Beta distributio: a RV has a Beta distributio with parameters α ad β if its desity o [0, ] is f(p) = Γ(α + β) Γ(α)Γ(β) pα ( p) β. For us it will be sufficiet to igore the ormalizig part ad just remember that the Beta desity: f(p) p α ( p) β. The mea of the Beta distributio is: α/(α + β). Let us deote X = There are two cadidate priors we will cosider: X i.. The flat/uiformative prior: π(p) = for 0 p. I this case, the posterior desity is: f(p X,..., X ) p X( ( X) p) = p ( X+) ( p) (( X)+). This is just a Beta( X +, ( X) + ) distributio. So our estimate (the posterior mea) would be: ˆp = X + + 2 = + 2 X + 2 + 2 = w X + ( w) 2, which ca be viewed as a covex combiatio of the MLE ad the prior mea /2. 2

9-8 Lecture 9: September 9 2. The other commo prior is the oe that is cojugate to the beroulli likelihood, i.e. the Beta prior. A similar calculatio as the oe above will show that if use π(p) Beta(α, β), the the posterior distributio will also be a Beta distributio: ad our Bayes estimator would be: f(p X,..., X ) Beta(α + X, β + ( X)), ˆp = α + X α + β +. Example 2: Gaussia Bayes estimator: Here we suppose that our prior belief is that the parameter has distributio N(µ, τ 2 ) ad we observe X,..., X draw from N(θ, σ 2 ). We assume that σ 2, µ, τ 2 are all kow. A fairly ivolved calculatio i this case (see for istace Problem of Chapter of Wasserma) will show that the posterior distributio is also a Gaussia with parameters: ( ) ˆµ = τ 2 σ 2 X σ 2 + τ 2 i + σ 2 + τ (µ) 2 ˆσ 2 = σ2 τ 2 σ 2 + τ 2, which suggests the poit estimate: ˆµ = τ 2 σ 2 + τ 2 ( ) σ 2 X i + σ 2 + τ (µ). 2 Which ca oce agai be see as a covex combiatio of our prior belief ad the MLE, i.e.: ( ) ˆµ = w X i + ( w)µ, for some 0 w.