Stat410 Probability and Statistics II (F16)

Similar documents
Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Unbiased Estimation. February 7-12, 2008

STATISTICAL INFERENCE

Problem Set 4 Due Oct, 12

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 472 / SPRING 2013 ASSIGNMENT 2: DUE FEBRUARY 4 FINALIZED

Lecture 12: September 27

ECE 901 Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation

Lecture 11 and 12: Basic estimation theory

Lecture 7: Properties of Random Samples

Kurskod: TAMS11 Provkod: TENB 21 March 2015, 14:00-18:00. English Version (no Swedish Version)

Last Lecture. Biostatistics Statistical Inference Lecture 16 Evaluation of Bayes Estimator. Recap - Example. Recap - Bayes Estimator

Introductory statistics

Asymptotics. Hypothesis Testing UMP. Asymptotic Tests and p-values

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Estimation for Complete Data

Statistical Theory MT 2009 Problems 1: Solution sketches

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Chapter 6 Principles of Data Reduction

Topic 9: Sampling Distributions of Estimators

Random Variables, Sampling and Estimation

Exponential Families and Bayesian Inference

Lecture 19: Convergence

Maximum Likelihood Estimation

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Solutions: Homework 3

Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

Maximum Likelihood Estimation and Complexity Regularization

Statistical Theory MT 2008 Problems 1: Solution sketches

Lecture 10 October Minimaxity and least favorable prior sequences

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Expectation and Variance of a random variable

STAT Homework 2 - Solutions

1.010 Uncertainty in Engineering Fall 2008

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Machine Learning Theory (CS 6783)

Topic 9: Sampling Distributions of Estimators

STAT Homework 1 - Solutions

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Topic 9: Sampling Distributions of Estimators

Lecture 23: Minimal sufficiency

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Lecture 7: October 18, 2017

Lecture 9: September 19

Distribution of Random Samples & Limit theorems

Lecture 18: Sampling distributions

Machine Learning Brett Bernstein


Statistical inference: example 1. Inferential Statistics

TAMS24: Notations and Formulas

Machine Learning Brett Bernstein

Topic 8: Expected Values

AMS570 Lecture Notes #2

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

6. Sufficient, Complete, and Ancillary Statistics

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 2: Monte Carlo Simulation

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Summary. Recap ... Last Lecture. Summary. Theorem

Final Examination Statistics 200C. T. Ferguson June 10, 2010

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Empirical Process Theory and Oracle Inequalities

6.3 Testing Series With Positive Terms

Econ 325: Introduction to Empirical Economics

Last Lecture. Unbiased Test

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Lecture 3. Properties of Summary Statistics: Sampling Distribution

EE 4TM4: Digital Communications II Probability Theory

1. Parameter estimation point estimation and interval estimation. 2. Hypothesis testing methods to help decision making.

LECTURE 8: ASYMPTOTICS I

2. The volume of the solid of revolution generated by revolving the area bounded by the

Probability and Statistics

Estimation of the Mean and the ACVF

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

4.1 Non-parametric computational estimation

Stat 421-SP2012 Interval Estimation Section

MATH/STAT 352: Lecture 15

SDS 321: Introduction to Probability and Statistics

Parameter, Statistic and Random Samples

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

Transcription:

Some Basic Cocepts of Statistical Iferece (Sec 5.) Suppose we have a rv X that has a pdf/pmf deoted by f(x; θ) or p(x; θ), where θ is called the parameter. I previous lectures, we focus o probability problems where the value of θ is give. From ow o, we focus o statistical problems where θ is ukow ad we try to get some iformatio about θ from a radom sample (X,..., X ) from this distributio. Parameter θ Radom sample: (X,..., X ) iid f( ; θ) Observed sample: (x,..., x ) oe realizatio of (X,..., X ) Statistic T T (X,..., X ): a fuctio of the sample, which is also radom. Estimator ˆθ ˆθ(X,..., X ) of θ is a fuctio of the sample, i.e., a statistic. Give a observed sample (X x,..., X x ), the value of ˆθ(x,..., x ) is called a Estimate of θ. So, a estimator is a radom variable, while a estimate is a real umber (i.e., oe realizatio of the Estimator). Overview of Estimatio How to derive a estimator? Method of Momets: suppose E(X) h(θ). Set the sample mea X h( θ), the solve for θ. Maximum Likelihood Estimator (see below). Notatio: I ll use ˆθ as a geeric otatio for a estimator of parameter θ. I problems where we eed derive estimators based o differet approaches, I use θ for oe estimator of θ ad ˆθ aother estimator of θ, for example, θ for method of momets estimator ad ˆθ for MLE. How to evaluate the performace of a estimator? Note that ˆθ is a radom variable, usually a cotiuous radom variable, so the chace that ˆθ θ is zero. But we ca say whether o average ˆθ is equal to θ, which leads to the defiitio of Bias(ˆθ) E(ˆθ) θ. Aother metric is the averaged square distace from ˆθ to the target θ, which leads to the defiitio of Mea Squared Error MSE(ˆθ) E [ (ˆθ θ) 2].

If we have derived multiple estimators for θ, we ca compare their MSE s by their relative efficiecy. Maximum Likelihood Estimator (MLE, Sec 6.) MLE: the estimator or estimators that maximize the Likelihood fuctio How to derive MLE? L(θ; x) f(x,..., x ; θ) f(x i ; θ). Step : Compute log f(x; θ); Step 2: Plug x i ito the expressio derived at Step ad sum them over i, which gives us the log likelihood fuctio: [ ] l(θ) log f(x i ; θ) log f(x i ; θ). Step 3: Fid the maximum of l(θ): ˆθ arg max θ l(θ). Tips/Tools for Step 3. Take derivative of l(θ) with respect to θ ad solve l (θ) 0 for θ. Be careful whe the parameter θ is i a bouded regio, say θ 0 or θ [0, ]. Make sure the solutio ˆθ is i the rage of θ. If it s outside the rage, usually you eed to check whether oe of the boudary poits is the maximum. If X is bouded ad the bouds deped o θ, remember to add the idicator fuctio i f(x; θ). For example, if X Uif(0, θ], the f(x; θ) θ (0 x θ). Some special optimizatio results (proofs are i the Appedix): mi a x i a is achieved at a media(x i ). max p,...,p m m j j log p j where 0 p j ad j p j ad j 0, is achieved by settig p j j / where + + m. The objective fuctio could have multiple solutios or o solutio (i.e., MLE does t exist). Thm 6..2 (Ivariace property of MLE) Let X,..., X be a radom sample with the pdf f(x; θ). Let η g(θ) be a parameter of iterest. Suppose ˆθ is the mle of θ. The g(ˆθ) is the mle of g(θ). MLE may ot be uique. 2

Below we list the MLE the method of momets (MM) estimators for various distributio families. The derivatio is t difficult. I class, I ll go through some of them, but expect you to go through the remaiig by yourself. Dist pmf/pdf MLE MM Ber(θ) θ x ( θ) x X X Bi(, θ) ( ) y θ y ( θ) y Y/ Y/ Geo(p) p( p) (x ) / X / X Po(λ) λ x x! e λx X X Ex(λ) λx λx / X / X Ex(/θ) θ x x/θ X X N(µ, σ 2 ) 2πσ 2 e (x µ)2 /(2σ 2 ) X, X, i (X i X) 2 i (X i X) 2 Beta(α, ) Beta(/θ, ) Γ(α+) Γ(α) x α /( i log X i ) Γ( θ +) Γ( θ ) x /θ i log X i X X X Ga(α, β), α kow β α Γ(α) xα e βx α/ X α/ X Ga(α, /θ), α kow Γ(α)θ α x α e x/θ X/α X/α Table : List of MLE ad MM estimators for various parametric families. For the biomial distributio, we cosider the case where we have oly oe observatio Y Bi(, θ). If we have Y,..., Y m iid Bi(, θ), the the MLE ad MM estimator will be Ȳ / (Y + + Y m )/(m). The parameterizatio of Geo(p) is differet from the oe i Appedix D. Here X Geo(p) deotes the umber of Beroulli trials you have coducted before seeig the first Head, icludig the last trial i which you observe a Head. That is, X, 2,... ad you observe oe Head ad (X ) Tails. Beta(/θ, ) is discussed o p3 of Week7 Estimatioas.pdf. 3

More Examples (6..3): Let X,..., X be a radom sample from the Laplace distributio with pdf f(x; θ) 2 e x θ. Show that the mle of θ is give by ˆθ media(x,..., X ). Sice the mea of this distributio is θ, the method of momets estimator should be X. (6.4.5): Suppose we have a bag of marbles i three differet colors, red, blue, ad gree. To estimate the proportio of the three colors (p, p 2, p 3 ), we coducted the followig experimet: we radomly draw a marble from the bag, record its color, X i, if red 2, if blue 3, if gree ad put it back to the bag; repeat this process times. X i s are iid samples from a multiomial distributio: X i, 2, 3 with probabilities p, p 2, p 3 respectively. The joit likelihood of X,..., X is equal to f(x,..., X p, p 2, p 3 ) p p 2 2 p 3 3, which oly depeds o j umber of j s we have i the marbles. To fid the MLE of p (p, p 2, p 3 ) t, we ca maximize the log-likelihood fuctio subject to the costraits that l(p) log p + 2 log p 2 + 3 log p 3 () 0 p, p 2, p 3 ad p + p 2 + p 3. The MLE is give by ˆp, ˆp 2 2, ˆp 3 3, i.e., we use the sample frequecies to estimate the proportio. The MLE ad MM estimator are the same. If we have observed 2 red marbles, 3 blue marbles, ad 5 gree marbles, the the MLE for (p, p 2, p 3 ) is give by ˆp.2, ˆp 2.3, ˆp 3.5 Let X,..., X be a radom sample from Ex(λ) with pdf f(x; θ) λx λx, x > 0. 4

a) (6..2) Show that the mle of λ is give by ˆλ / X. b) What s the mle of the probability P(X > )? c) If we parameterize the Expoetial family by θ /λ, what s the mle of θ? Let X,..., X be a radom sample from Ber(θ) with pmf p(x; θ) θ x ( θ) x, x 0,. a) (6..) Show that the mle of θ is give by ˆθ X. b) Let Y X + + X. So Y Bi(, θ). Derive the mle of θ give Y. c) (6..6) Show that the MLE of θ is give by ˆθ mi( X, 3 ), if 0 θ /3. (Example from Week8 Estimatio2as.pdf) Let λ ad let X,..., X be a radom sample from the distributio with pdf f(x; λ) 2λ 2 x 3 e λx2, x > 0. Defie Y X 2, which is a oe-to-oe trasformatio sice X > 0. The pdf for Y is give by f Y (y) f X ( y) dy dx, dx where dy d y dy 2 y 2λ 2 y 3/2 e λy 2 y ye λy i.e., Y Ga(2, /λ). So the MLE ad the MM estimator of λ are the same, give by 2/Ȳ 2 i X2 i 2. i X2 i (6..5): Let X,..., X be a radom sample from Uif(0, θ] with pdf a) Derive the MLE of θ. The likelihood fuctio is give by f(x,..., x ) j f(x; θ) θ I (0<x θ). θ I (0<x i θ) θ I (0< all x i s θ). Graph this fuctio: it s a mootoe decreasig fuctio of θ whe θ max i x i, but zero whe θ < max i x i s. So ˆθ max i X i. 5

b) Derive the MM of θ. EX 2θ, so θ X/2. c) What if we chage the distributio to be Uif(0, θ), i.e., f(x; θ) θ I (0<x<θ). The likelihood fuctio becomes θ I (0< all xi s <θ), which is a decreasig fuctio whe θ > max i x i, but zero whe θ max i x i. So the MLE does ot exist i this case. Ubiased Estimators A estimator is called ubiased if E(ˆθ) θ. checkig whether a estimator is biased/ubiased. The followig results are useful whe E( i a i X i ) a i E(X i ) Var( i a i X i ) i,j a i a j Cov(X i, X j ) i a 2 i Var(X i ) + 2 i<j Cov(X i, X j ). If E(X) θ, ad your estimator of g(θ) is g( X), the likely you eed to use Jese s iequality to show that g( X) is biased. You eed to check whether g or g is covex, ad the call the Jese s iequality. g (x) 0 : g is covex ad Eg( X) g(θ); g (x) 0 : g is covex ad Eg( X) g(θ); Let (X,..., X ) be a radom sample from a distributio with mea µ ad variace σ 2. The E X [ µ, E(S 2 ( ) E Xi X ) ] 2 σ 2. That is, sample mea ad sample variace are ubiased. How to show E(S 2 ) σ 2? ( Xi X ) 2 ( X 2 i 2X i X + X 2) ( ( X 2 i X 2 i ) 2 X Xi 2 X 2. ( ) ( X i + ) 2 X X + X 2 X 2) 6

[ E (X i X) 2] [ E Xi 2 X 2] [ E(Xi 2 ) E( X ] 2 ) [ (µ 2 + σ 2 ) (µ 2 + σ2 ) ] ( )σ2 σ 2. Most estimators i Table are ubiased, except Ex(λ). The estimator / X is biased (Jese s iequality), ad the estimator for θ /λ are ubiased. No(µ, σ 2 ). The MLE of σ 2 is biased. Geo(p). The estimator / X is biased (Jese s iequality). Beta(α, ). Both the MLE ad the MM estimator of α are biased (Jese s iequality). The MLE of θ /α is ubiased, but the MM estimator of θ is biased (Jese s iequality). As we ll say that this case is the same as the expoetial distributio sice log X follows a Expoetial distributio. Ga(α, β) with α kow. The estimator α/ X for β is biased ((Jese s iequality), but the estimator X/α for θ /β is ubiased. More Examples. Let (X,..., X ) be a radom sample from Uif(0, θ]. The MLE is biased. Y max X i, E(Y ) i + θ. The MM estimator θ 2 X is ubiased. Let (X,..., X ) be a radom sample from a distributio with pdf f(x) whose mea µ exists ad which is symmetric about µ. Show that E(sample media) µ. Without loss of geerality, assume µ 0 (Why?). That is, the pdf f(x) is symmetric about zero. Due to the symmetry, we ca show that 7

The distributio of a radom sample (X,..., X ) should be the same as the distributio of ( X,..., X ). So the distributio of the sample media of (X,..., X ) should be the same as the distributio of the sample media of ( X,..., X ). E(sample media of X : ) E(sample media of X,..., X ) So E(sample media of X : ) 0. E(sample media of X : ). X,..., X Laplacia distributio with pdf f(x; θ) 2 e x θ. The MLE ˆθ Med(X,..., X ) is ubiased. Mea Squared Error (MSE) For a estimator ˆθ of θ, defie the Mea Squared Error of ˆθ by MSE(ˆθ) E(ˆθ θ) 2 E[ˆθ E(ˆθ)] 2 + Var(ˆθ) Bias 2 + Var Specially, if ˆθ is ubiased, the MSE(ˆθ) Var(ˆθ). Let ˆθ ad ˆθ 2 be two ubiased estimator of ˆθ. ˆθ is said to be more efficiet tha ˆθ 2 if Var(ˆθ ) < Var(ˆθ 2 ). The relative efficiecy of ˆθ with respect to (wrt) ˆθ 2 is Var(ˆθ 2 )/Var(ˆθ ). Examples. X,..., X Uif(0, θ]. We have leared that the MLE ˆθ ad the MM estimator θ are give by ˆθ Y max X i, θ 2 X. i Which estimator is better (i.e. havig the smallest MSE), ˆθ or θ? MSE(ˆθ) 2θ 2 ( + )( + 2), MSE( θ) θ2 3. 8

f Y (y) y θ, 0 < y θ. EY EY 2 θ 0 θ 0 yf Y (y)dy y 2 f Y (y)dy Var(Y ) EY 2 (EY ) 2 θ y 0 θ θ y+ 0 + θ θ + 2 θ2 ( + ) 2 ( + 2) MSE(ˆθ) Bias 2 + Var(Y ) (EY θ) 2 + EY 2 (EY ) 2. MSE( θ) 4Var( X) 4 Var(X ) 4 θ2 2 θ2 3 Note that although θ 2 X is ubiased for θ, while ˆθ max i X i is biased. The MSE of ˆθ is much smaller tha the MSE of θ for large. What must c equal if cˆθ is to be a ubiased estimator of θ? That is, costruct a ubiased estimator based the MLE ˆθ. Which estimator is more efficiet, θ or + ˆθ? What s the relative efficiecy of ˆθ wrt θ? + ( + MSE ˆθ ) ( + Var The relative efficiecy of + ˆθ wrt θ is equal to ˆθ ) ( + )2 2 Var(Y ) ( + )2 2 θ 2 ( + ) 2 ( + 2) θ 2 ( + 2). θ 2 / θ 2 3 ( + 2) + 2. 3 Whe the sample size gets larger, the estimator + ˆθ (that is also ubiased) is much more efficiet tha the MM estimator θ. (*) (Sec 7.) I geeral, it s difficult to compare two estimators based o their MSE, sice MSE may deped o the ukow parameter θ: It is possible that MSE(ˆθ ) < MSE(ˆθ 2 ) (i.e., ˆθ is better) whe θ >, while ˆθ 2 is better for θ <. I courses o statistical decisio theory, you ll lear how to compare estimators based o their maximum MSE (i.e., their worst-case performace) or based o averaged MSE (i.e., Bayes risk). 9

Summary Estimators you ll ecouter i Stat 40 take the followig forms:. ˆθ the sample mea X or liear fuctios of the sample mea, e.g., 2 X. E(a X) a µ (usually ubiased). Var(a X) a 2 σ2. 2. ˆθ the average of a trasformatio of Xi s, e.g., log X i or X2 i. Defie Y i log X i the ˆθ log X i Ȳ. Fid the distributio of Y i ad the you are back to the previous case. 3. ˆθ g( X), a fuctio of the sample mea, e.g., / X. Check the sig of the secod derivative of g i the rage of possible values of X. Use Jese s iequality to show ˆθ is biased. Usually you wo t be asked to compute the bias ad variace. 4. ˆθ order statistics, e.g., Y mi i X i or Y max i X i. Fid the distributio of Y or Y. Compute the mea (usually biased) ad variace. EY 2 (EY ) 2. For variace, use formula 0

Appedix (Feel free to igore the materials i the Appedix.) Let x,..., x be a sequece of umbers, fid the value a that miimizes g(a) x i a. It is equivalet to write the objective fuctio as g(a) x (i) a, where x () < < x () are the order statistics of x i s. It s okay to assume that all the x i s are differet (i.e., o ties), sice they are usually radom samples from a cotiuous distributio. Recall this result: suppose g(x) x, the g (x) if x > 0, if x < 0, ad g (0) does ot exist. So the derivatio of g is ot well defied at x i s. Suppose a x (), the g(a) (i) a) (x i x i a, which is a decreasig fuctio, so its miimal is achieved at a x (). Similarly, we ca check the case whe a x (). We ca coclude that it suffices to fid the optimal value of a i the data rage, [x (), x () ]. Assume we have a odd umber of samples, i.e., 2m +. Whe a [x (), x (m+) ], g(a) is a decreasig fuctio; Whe a [x m+, x ], g(a) is a icreasig fuctio. Therefore the miimal of g(a) is achieved whe a x (m+), the media of (x,..., x ). Assume we have a eve umber of samples, i.e., 2m. Whe θ [x (), x (m) ], g(a) is a decreasig fuctio; Whe a [x (m+), x () ], g(a) is a icreasig fuctio; Whe a [x (m), x (m+) ], g(a) is a costat. So the miimizer of g(a) is ay value i the iterval [x (m), x (m+) ], the sample media of the data poits, which is ot uique.

max p,...,p m m j j log p j where 0 p j ad j p j ad j 0, is achieved by settig p j j / where + + m. This is a costraied optimizatio problem, which, of course, ca be solved by usig tools like the Lagrage multiplier. Next I ll give a simple derivatio based o the o-egativity of the Kullback-Leibler divergece. Scale the objective fuctio by ( ) ad look for the miimal. We have m j j log p j m j j log j/ p j m j j log j, where the secod term (o the right) has othig to do with (p,..., p m ), ad the first term is the Kullback-Leibler divergece betwee two multiomial distributios whose miimal is achieved by settig ˆp,, ˆp m m. 2