Inference in non-linear time series

Similar documents
Introduction to Estimation Methods for Time Series models Lecture 2

Brief Review on Estimation Theory

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

6.1 Variational representation of f-divergences

Chapter 3. Point Estimation. 3.1 Introduction

Review and continuation from last week Properties of MLEs

δ -method and M-estimation

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

simple if it completely specifies the density of x

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

A Very Brief Summary of Statistical Inference, and Examples

Lecture 17: Likelihood ratio and asymptotic tests

Econometrics I, Estimation

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Lecture 7 Introduction to Statistical Decision Theory

Chapter 4: Asymptotic Properties of the MLE

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Normalising constants and maximum likelihood inference

Lecture 8: Information Theory and Statistics

System Identification, Lecture 4

Lecture 6: Model Checking and Selection

ECE 275A Homework 7 Solutions

Mathematical statistics

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models

Estimation of Dynamic Regression Models

Z-estimators (generalized method of moments)

Qualifying Exam in Probability and Statistics.

DA Freedman Notes on the MLE Fall 2003

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Information in a Two-Stage Adaptive Optimal Design

System Identification, Lecture 4

5.2 Fisher information and the Cramer-Rao bound

Evaluating the Performance of Estimators (Section 7.3)

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

A Very Brief Summary of Statistical Inference, and Examples

Lecture 1: Introduction

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Loglikelihood and Confidence Intervals

Model comparison and selection

Classical Estimation Topics

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Lecture 3 September 1

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Maximum Likelihood Estimation

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Estimation of arrival and service rates for M/M/c queue system

Foundations of Statistical Inference

Lecture 14 October 13

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Model selection in penalized Gaussian graphical models

Maximum Likelihood Estimation

Mathematical statistics

Institute of Actuaries of India

Time Series Analysis Fall 2008

Testing Restrictions and Comparing Models

i=1 h n (ˆθ n ) = 0. (2)

Review of Probabilities and Basic Statistics

Chapter 3 : Likelihood function and inference

Graduate Econometrics I: Maximum Likelihood I

Lecture 10 Maximum Likelihood Asymptotics under Non-standard Conditions: A Heuristic Introduction to Sandwiches

Approximate Likelihoods

Ch. 5 Hypothesis Testing

Elements of statistics (MATH0487-1)

Some General Types of Tests

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

10. Linear Models and Maximum Likelihood Estimation

Master s Written Examination

Fall, 2007 Nonlinear Econometrics. Theory: Consistency for Extremum Estimators. Modeling: Probit, Logit, and Other Links.

David Giles Bayesian Econometrics

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Theory of Maximum Likelihood Estimation. Konstantin Kashin

1. Fisher Information

Lecture 32: Asymptotic confidence sets and likelihoods

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Hypothesis testing: theory and methods

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

EE514A Information Theory I Fall 2013

Minimum Message Length Analysis of the Behrens Fisher Problem

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM

Economics 620, Lecture 9: Asymptotics III: Maximum Likelihood Estimation

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Multivariate Analysis and Likelihood Inference

Economics 583: Econometric Theory I A Primer on Asymptotics

STAT 512 sp 2018 Summary Sheet

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Transcription:

Intro LS MLE Other Erik Lindström Centre for Mathematical Sciences Lund University LU/LTH & DTU

Intro LS MLE Other General Properties Popular estimatiors Overview Introduction General Properties Estimators Least squares Maximum Likelihood MLE asymptotics Two theorems Fisher Information Proofs Other estimators

Intro LS MLE Other General Properties Popular estimatiors General We are interested in estimating a parameter vector θ 0 from data X. Ad hoc or formal estimation? Properties Denition ˆθ = T (X). Interpretation

Intro LS MLE Other General Properties Popular estimatiors Properties Bias b = θ 0 E[T (X)]. Asympt. bias lim θ 0 E[T (X)]. Consistency ˆθ p θ 0 P( ˆθ θ 0 > ε) 0 as. a.s. Strong consistency ˆθ θ 0. Eciency Var[T (X)] I 1 (θ 0), where I 1 (θ 0) is the Fisher information matrix. Asympt. normality Convergence α (ˆθ θ 0 ) d (µ, Σ).

Intro LS MLE Other General Properties Popular estimatiors Popular estimators How are the estimates computed? Optimization: Least squares (LS), Weighted Least Squares (WLS), Prediction Error Methods (PEM), Generalized Method of Moments (GMM) etc. Solving equations: Estimation functions (EFs), Method of moments (MM) Either: Bayesian methods (MCMC), Maximum Likelihood, Quasi Maximum Likelihood

Intro LS MLE Other Least squares Observations y 1,..., y Predictors J j=1 B j(x)θ j Vector form: Y = Zθ + ε, ε (0, Ω) (Weighted) Parameter estimation: Generalization ˆθ LS = argmin θ Θ n=1 λ n y n j B j (x n )θ j 2 (1) = (Y Zθ) T W (Y Zθ) (2) ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (3)

Intro LS MLE Other Least squares Observations y 1,..., y Predictors J j=1 B j(x)θ j Vector form: Y = Zθ + ε, ε (0, Ω) (Weighted) Parameter estimation: Generalization ˆθ LS = argmin θ Θ n=1 λ n y n j B j (x n )θ j 2 (1) = (Y Zθ) T W (Y Zθ) (2) ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (3)

Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)

Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)

Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)

Intro LS MLE Other Estimate ˆθ = ( Z T WZ) 1 (ZWY ) (4) Bias? o, not if the correct model is used ] ( 1 E [ˆθ = Z WZ) T (ZW (Zθ + ε)) (5) ( 1 = θ + Z WZ) T (ZW ε) (6) Variance ] Var [ˆθ = (Z T WZ) 1 (Z T W ΩWZ)(Z T WZ) 1 (7) Simplies if W = Ω 1, and Ω = σ 2 I. We then get ] Var [ˆθ = σ 2 (Z T Z) 1. (8)

Intro LS MLE Other F-tests We estimate σ 2 = (Y X θ)t (Y X θ) J Dene Q(ˆθ) = (Y X ˆθ) T (Y X ˆθ) = Y T (I P) T (I P)Y where P is the projection matrix We approximate Q by E[Q] = tr ( (I P) T (I P) Cov[Y, Y ] ). It holds for the standard model that Cov[Y, Y ] = σ 2 I and (I P) T (I P) = (I P) It then follows that ( ) tr (I P) T (I P) Cov[Y, Y ] = σ 2 tr (I P) (9) = σ 2 tr(i ) σ 2 tr(p) (10) ( ) = σ 2 σ 2 tr Z(Z T Z) 1 Z T (11) ( ) = σ 2 ( tr (Z T Z) 1 (Z T Z) = σ 2 ( J) (12)

Intro LS MLE Other What about the penalized asymptotics? What if the estimation is given by ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (13) Derive on blackboard. Another popular form is ˆθ Adaptive LASSO = (Y Zθ) T W (Y Zθ) + D(θ θ 0 ) 1 (14)

Intro LS MLE Other What about the penalized asymptotics? What if the estimation is given by ˆθ PLS = (Y Zθ) T W (Y Zθ) + (θ θ 0 ) T D(θ θ 0 ) (13) Derive on blackboard. Another popular form is ˆθ Adaptive LASSO = (Y Zθ) T W (Y Zθ) + D(θ θ 0 ) 1 (14)

Intro LS MLE Other Theorems Info Proofs Maximum Likelihood estimators Dened as ˆθ MLE = argmax p θ (X 1,... X ). θ Θ The MLE use all information in the data, and have nice properties: Ia. Consistency ˆθ p θ 0. Ib. Asympt. normality II. Eciency Var[T (X)] = I 1 0 (θ 0 ), III. Invariant. We have under general conditions that (ˆθ θ0 ) d (0, I 1 0 (θ 0 )).

Intro LS MLE Other Theorems Info Proofs Two useful theorems Assume X i iid and E[X i ] = µ, Var[X i ] = σ 2 Law of Large umbers 1 Central Limit Theorem where Z (0, 1). 1 i=1 i=1 X i p/a.s. µ. X i µ σ These theorems can be generalized further. d Z,

Intro LS MLE Other Theorems Info Proofs Fisher information The Fisher Information matrix is dened as [ ] I (θ) = Var θ log p θ(x ), which is equivalent to and E ote E [ θ log p θ(x θ) ] = 0. [ ( ) ] 2 θ log p θ(x ) [ ] 2 E θ 2 log p θ(x ).

Intro LS MLE Other Theorems Info Proofs Proofs Ia. Kullback-Leibler and law of large number Ib. Second order Taylor expansion of the Likelihood. II. Cauchy-Schwartz inequality III. Direct calculations.

Intro LS MLE Other Theorems Info Proofs Consistency The log-likelihood function is dened as l(θ) = log p θ (x n x 1:n 1 ) (15) The estimate is given by n=1 1 ˆθ MLE = argmax l(θ) = argmax l(θ) (16) θ Θ θ Θ = argmax θ Θ 1 n=1 log p θ (x n x 1:n 1 ) (17) ow we rewrite this as 1 l(θ) = 1 l(θ) ± E[l(θ)] (18) ( ) 1 = E[l(θ)] + l(θ) E[l(θ)] (19)

Intro LS MLE Other Theorems Info Proofs Consistency The log-likelihood function is dened as l(θ) = log p θ (x n x 1:n 1 ) (15) The estimate is given by n=1 1 ˆθ MLE = argmax l(θ) = argmax l(θ) (16) θ Θ θ Θ = argmax θ Θ 1 n=1 log p θ (x n x 1:n 1 ) (17) ow we rewrite this as 1 l(θ) = 1 l(θ) ± E[l(θ)] (18) ( ) 1 = E[l(θ)] + l(θ) E[l(θ)] (19)

Intro LS MLE Other Theorems Info Proofs Consistency The log-likelihood function is dened as l(θ) = log p θ (x n x 1:n 1 ) (15) The estimate is given by n=1 1 ˆθ MLE = argmax l(θ) = argmax l(θ) (16) θ Θ θ Θ = argmax θ Θ 1 n=1 log p θ (x n x 1:n 1 ) (17) ow we rewrite this as 1 l(θ) = 1 l(θ) ± E[l(θ)] (18) ( ) 1 = E[l(θ)] + l(θ) E[l(θ)] (19)

Intro LS MLE Other Theorems Info Proofs Consistency cont. This decomposition shows that 1 l(θ) = It then follows that E[l(θ)] }{{} Expected log likelihood ˆθ MLE = argmax E[l(θ)] + θ Θ ( + 1 l(θ) E[l(θ)] } {{} Random error ) ˆθ MLE argmax E[l(θ)] θ Θ (20) (21)

Intro LS MLE Other Theorems Info Proofs First we look at the random error, that is written so that we can apply the Law of Large umbers. If lim uniformly over Θ, then it follows that lim 1 l(θ) E[l(θ)] = 0 (22) ˆθ MLE = argmax E[l(θ)] (23) θ Θ

Intro LS MLE Other Theorems Info Proofs Finally, we denote any feasible density by q θ Q θ and assume that the true density p θ0 Q θ. The dierence in expected log-likelihood between the true density and any other density is given by E[log q θ (X )] E[log p θ0 (X )] = (log q θ (x) log p θ0 (x)) p θ0 (x)dx = (24) ( ) qθ (x) log p θ0 (x)dx p θ0 (x) (25) Log is a concave function. We know from Jensen's inequality that E[log(ξ)] log(e[ξ])

Intro LS MLE Other Theorems Info Proofs Finally, we denote any feasible density by q θ Q θ and assume that the true density p θ0 Q θ. The dierence in expected log-likelihood between the true density and any other density is given by E[log q θ (X )] E[log p θ0 (X )] = (log q θ (x) log p θ0 (x)) p θ0 (x)dx = (24) ( ) qθ (x) log p θ0 (x)dx p θ0 (x) (25) Log is a concave function. We know from Jensen's inequality that E[log(ξ)] log(e[ξ])

Intro LS MLE Other Theorems Info Proofs Finally, we denote any feasible density by q θ Q θ and assume that the true density p θ0 Q θ. The dierence in expected log-likelihood between the true density and any other density is given by E[log q θ (X )] E[log p θ0 (X )] = (log q θ (x) log p θ0 (x)) p θ0 (x)dx = (24) ( ) qθ (x) log p θ0 (x)dx p θ0 (x) (25) Log is a concave function. We know from Jensen's inequality that E[log(ξ)] log(e[ξ])

Intro LS MLE Other Theorems Info Proofs That means that ( qθ (x) log p θ0 (x) ) p θ0 (x)dx log ( ) qθ (x) p θ0 (x) p θ 0 (x)dx = 0 (26) We only get equality when q θ (x) = p θ0 (x) for all values of x. The conclusion is that and hence that as the random error vanishes as. θ 0 = argmax E[l(θ)] (27) θ Θ lim ˆθ MLE = θ 0 (28)

Intro LS MLE Other Theorems Info Proofs That means that ( qθ (x) log p θ0 (x) ) p θ0 (x)dx log ( ) qθ (x) p θ0 (x) p θ 0 (x)dx = 0 (26) We only get equality when q θ (x) = p θ0 (x) for all values of x. The conclusion is that and hence that as the random error vanishes as. θ 0 = argmax E[l(θ)] (27) θ Θ lim ˆθ MLE = θ 0 (28)

Intro LS MLE Other Theorems Info Proofs Proof Ib ( Rewrite L(X) = exp log p θ (X 1 ) + ) n=2 log p θ(x n X 1:n 1 ). Second order Taylor expansion around θ 0. Maximize (ˆθ θ 0 )... Consistency and asymp normality follows.

Intro LS MLE Other Theorems Info Proofs Proof I, Ext. Write l (θ) = log L(X θ). This is a sum consisting of terms (think LL and CLT!). Approximate l (θ) using a second order Taylor expansion around θ 0. Thus l (θ) = l (θ 0 ) + θ l (θ 0 )(θ θ 0 ) + 1 2 2 θ l (θ 0 )(θ θ 0 ) 2 + R. Ignore the last term, multiply l with 1/ and maximize wrt θ to obtain ˆθ. We obtain 1 θ l (θ 0 ) + 1 2 θ l (θ 0 ) (ˆθ θ 0 ) := 0.

Intro LS MLE Other Theorems Info Proofs Proof I, Ext. We have from LL that and from CLT that 1 2 θ l (θ 0 ) I (θ 0 ), 1 θ l (θ 0 ) Z, where Z (0, I (θ 0 )). Rearranging gives (ˆθ θ 0 ) d (0, I 1 (θ 0 )).

Intro LS MLE Other Theorems Info Proofs Proof II Denote the score function U = θ log θ p(x ) and another estimator by T. Assume that E[T ] = θ 0 (to rule out anomalies). Calculate Cov(U, T ) Cov(U, T ) = E[UT ] E[U]E[T ] = E[UT ] θ 0 0 E[UT ] = T (x) θ log p θ 0 (x)p θ0 (x)dx T (x)pθ0 (x)dx = θ θ = 1. = θ Cauchy-Schwartz states Cov(U, T ) Var[U]Var[T ]. Thus Var[T ] Var[ θ log p θ(x )] 1 I.e. Var[T ] Var[ˆθ MLE ]!

Intro LS MLE Other Theorems Info Proofs Proof III Original problem Dene Y = g(x ). Calculations... Ok. ˆθ MLE = argmax p θ (X 1,... X ). θ Θ

Intro LS MLE Other Other estimators We can derive the asymptotical distribution using the same arguments for any M-estimator, i.e. for any estimator taking the estimate as the value that maximized/minimizes a function of data. Similar arguments can also be used for Z-estimators, i.e. estimator taking the estimate as the value that solves a system of equations depending on data.

Intro LS MLE Other M-estimators Take any loss function J(θ), such that E.g. PEM: J(θ) = i=1 ε i(θ) 2. The corresponding problem is ˆθ = argmax J(θ). (29) θ Θ 1 θ J(θ 0 ) + 1 2 θ J(θ 0) (ˆθ θ 0 ) := 0. Assume that E[ 1 θ J(θ 0 )] = 0 (Why?), Var[ 1 θ J(θ 0 )] = W and E[ 1 2 θ J(θ 0)] = V. Applying the limit theorems as in the Maximum likelihood setup gives the asymptotics. Result: (ˆθ θ 0 ) d (0, V 1 W (V 1 ) T ).

Intro LS MLE Other Z-estimators Take any loss function G(θ), such that ˆθ = argmax G(θ) = 0. (30) θ Θ E.g. Estimation functions G(θ) = g i (θ)(x i E[X i ]) Compute a rst order Taylor expansion around θ 0, and solve wrt θ. Thus G(θ 0 ) + θ G(θ 0 )(ˆθ θ 0 ) := 0. Multiply by 1/ and apply limit theorems. Let E[ 1 G(θ 0 )] = 0 (why?), Var[ 1 G(θ 0 )] = W and E[ 1 θg(θ 0 )] = V. Then (ˆθ θ 0 ) d (0, V 1 W (V 1 ) T ).

Intro LS MLE Other The likelihood framework can be used to test the t of models. Condence intervals Likelihood ratio tests

Intro LS MLE Other Condence intervals. Assume (ˆθ θ 0 ) d (0, Σ). Then I θ0 ˆθ ± 2 diag(σ). Better accuracy: Use prole likelihood!

Intro LS MLE Other Likelihood based tests. Likelihood ratio tests where Θ R Θ F. Then Λ = {sup L(X θ), θ ΘF } {sup L(X θ), θ Θ R } 2 log(λ) χ 2 (k), k being the degrees of freedom. LR tests are optimal, cf. eyman-pearson. and are closely related to AIC, BIC etc.

Intro LS MLE Other Likelihood ratios Compare L(X ˆθ) to L(X θ 0 ). Taylor expand We have that ( ) 2 log(λ) = 2 l (ˆθ) l (θ 0 ). l (θ) = l (θ 0 ) + θ l (θ 0 )(θ θ 0 ) + 1 2 2 θ l (θ 0 )(θ θ 0 ) 2 + R. The estimate must solve θ l (θ 0 ) + 2 θ l (θ 0 )(ˆθ θ 0 ) = 0. Thus 2 log(λ) = 2 ( ) 1 2 θl (θ 0 )(ˆθ θ 0 ).

Intro LS MLE Other Likelihood ratios, cont. Plugging in (ˆθ θ 0 ) = ( 2 θ l (θ 0 )) 1 θ l (θ 0 ) gives 2 log(λ) = θ l (θ 0 )( 2 θ l (θ 0 )) 1 θ l (θ 0 ) Or nicer written 2 log(λ) = 1 θ l (θ 0 )( 1 2 θ l (θ 0 )) 1 1 θ l (θ 0 ). This is a quadratic form, distributed as χ 2 (k), k being the degrees of freedom.

Intro LS MLE Other Feedback Send feedback, questions and information about typos to erikl@maths.lth.se.