Estimators as Random Variables

Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maimum likelihood Consistency Confidence intervals Properties of the mean estimator Introduction Up until now we have defined and discussed properties of random variables and processes In each case we started with some known property (e.g. autocorrelation) and derived other related properties (e.g. PSD) In practical problems we rarely know these properties a priori In stead, we must estimate what we wish to know from finite sets of measurements J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 Terminology Suppose we have N independent, identically-distributed (i.i.d.) observations i } N i= Ideally we would like to know the pdf of the data where R p f(; ) In probability theory, we think about the likeliness of i } N i= given the pdf and In inference, we are given i } N i= and are interested in the likeliness of Called the sampling distribution We will use to denote the parameter (or vector of parameters) we wish to estimate Estimators as Random Variables Our estimator is a function of the measurements ˆ [ ] i } N i= It is therefore a random variable It will be different for every different set of observations It is called an estimate or, if is a scalar, a point estimate Of course we want ˆ to be as close to the true as possible This could be, for eample, the process mean μ J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 3 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 4

Natural Estimators ˆμ = ˆ [ ] i } N i= = N N n= This is the obvious or natural estimator of the process mean Sometimes called the average or sample mean It will also turn out to be the best estimator I will define best shortly ˆσ = ˆ [ ] i } N i= = N i N ( i ˆμ ) n= This is the obvious or natural estimator of the process variance Not the best Good Estimators Without loss of generality, let us consider a scalar parameter for the time being What is a good estimator Distribution of ˆ should be centered at the true value Want the distribution to be as narrow as possible Lower-order moments enable coarse measurements of good ˆ J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 5 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 6 Bias Bias of an estimator ˆ of a parameter is defined as B(ˆ) E[ˆ] Unbiased: an estimator is said to be unbiased if B(ˆ) =0 This implies the pdf of the estimator is centered at the true value The sample mean is unbiased The estimator of variance on the earlier slide is biased Unbiased estimators are generally good, but they are not always best (more later) Variance Variance of an estimator ˆ of a parameter is defined as [ ] var(ˆ) =σ ˆ E ˆ E [ˆ ] A measure of the spread of ˆ about its mean Would like the variance to be as small as possible J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 7 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 8

Bias-Variance Tradeoff The Bias-Variance Tradeoff ˆ ˆ In many cases minimizing variance conflicts with minimizing bias Note that ˆ 0 has zero variance, but is generally biased In these cases we must trade variance for bias (or vice versa) ˆ Understanding of the bias-variance tradeoff is crucial to this course Unbiased models are not always best The methods we will use to estimate the model coefficients are biased But they may be more accurate, because they have less variance This idea applies to nonlinear models as well ˆ J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 9 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 0 Bias, Variance, and Modeling y() =g()+ε ŷ() =ĝ() In the modeling contet, we are usually interested in estimating a function For a given input, this function is a scalar We can define = g() Thus, all of the ideas that apply to estimating parameters also apply to estimating functional relationships Notation and Prediction Error y = g()+ε g = g() ĝ =ĝ() ĝ e =E[ĝ()] Epectation is taken over the distribution of data sets used to construct ĝ() and the distribution of the process noise f(ε) Everything is a function of Recall that ε is i.i.d. with zero mean We are treating as a fied, non-random variable The dependence on is not shown to simplify notation The prediction error for a new, given input is defined as PE() = E[(y ĝ) ] = E[((g ĝ)+ε) ] = E[(g ĝ) ]+E[(g ĝ)ε]+e[ε ] = MSE()+σ ε J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6

The Bias-Variance Tradeoff Derivation y = g()+ε g = g() ĝ =ĝ() ĝ e =E[ĝ()] Only ĝ is a random function Nothing else is dependent on the data set MSE() = E[(g ĝ) ] = E[(g ĝ e ) (ĝ ĝ e )} ] = E (g ĝ e ) (g ĝ e )(ĝ ĝ e ) +(ĝ ĝ e ) }}}} Thus Bias-Variance Tradeoff Derivation Continued = E[(g ĝ e ) (g ĝ e )(ĝ ĝ e )] = E[g gĝ e +ĝe g(ĝ ĝ e )] + ĝe ĝe = E[g gĝ e +ĝe gĝ +gĝ e ] = E[g gĝ +ĝe] = g g E[ĝ]+ĝe = g gĝ e +ĝ e = (g ĝ e ) MSE() = + = (g ĝ e ) +E[(ĝ ĝ e ) ] = (g E[ĝ]) +E [ (ĝ E[ĝ]) ] J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 3 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 4 Bias-Variance Tradeoff Comments MSE() =(g E[ĝ]) +E [ (ĝ E[ĝ]) ] = Bias + Variance Large variance: the model is sensitive to small changes in the data set Large bias: if the model was compared to the true function on a large number of data sets, the epected value of the model ĝ() would not be close to the true function g() If the model is sensitive to small changes in the data, a biased model may have smaller error (MSE) than an unbiased model If the data is strongly collinear, biased models can result in more accurate models! Bias-Variance Tradeoff Comments Continued MSE() =(g E[ĝ]) +E [ (ĝ E[ĝ]) ] = Bias + Variance Large variance, small bias If the model is too fleible, it can overfit the data The model will change dramatically from one data set to another In this case it has high variance, but potentially low variance Small variance, large bias If the model is not very fleible, it may not capture the true relationship between the inputs and the output It will not vary as much from one data set to another In this case the model has low variance, but potentially high bias J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 5 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 6

Mean Square Error Mean Square Error of an estimator ˆ of a parameter is defined as MSE() E[ ˆ ]=σ ˆ + B(ˆ) We will use often use MSE as a global measure of estimator performance Note that two different estimators may have the same MSE but different bias and variance This criterion is convenient for building estimators Creating a problem we can solve Note the rationale is due to convenience: Picking MSE results in a simple bias/variance decomposition Other error measures generally do not have such a decomposition var(ˆ) E Cramér-Rao Lower Bound [ ln f; (;) ] = E[ ln f ; (;) ] Minimum Variance Unbiased (MVU): Estimators that are both unbiased and have the smallest variance of all possible estimators Note that these do not necessarily achieve the minimum MSE Cramér-Rao Lower Bound (CRLB) shown above is a lower bound on unbiased estimators Log Likelihood Function of is ln f ; (; ) Note that the pdf f ; (; ) describes the distribution of the data (stochastic process), not the parameter is not a random variable, it is a parameter that defines the distribution J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 7 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 8 Cramér-Rao Lower Bound Comments var(ˆ) E [ ln f; (;) ] = [ ] ln f E ; (;) Efficient Estimator: an unbiased estimate that achieves the CRLB with equality If it eists, then the unique solution is given by ln f ; (; ) =0 where the pdf is evaluated at the observed outcome (ζ) Maimum Likelihood (ML) Estimate: an estimator that satisfies the equation above This can be generalized to vectors of parameters Limited use f ; (; ) is rarely known in practice Consistency Consistent Estimator an estimator such that lim MSE(ˆ) =0 N Implies the following as the sample size grows (N ) The estimator becomes unbiased The variance approaches zero The distribution fˆ() becomes an impulse centered at J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 9 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 0

Confidence Intervals Confidence Interval: interval, a b, that has a specified probability of covering the unknown true parameter value Pr a < b} = α The interval is estimated from the data, therefore it is also a pair of random variables Confidence Level: coverage probability of a confidence interval, α The confidence interval is not uniquely defined by the confidence level Properties of the Sample Mean ˆμ N E[ˆμ ]=μ var(ˆμ )= σ N N k=0 N l= N (n) The estimator is unbiased Can also be shown that Has minimum variance If Gaussian, is the maimum likelihood estimator If Gaussian, attains the Cramér-Rao Lower Bound J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 fˆμ (ˆμ ) = Sample Mean Confidence Intervals [ π(σ / N) ep Pr μ k σ < ˆμ <μ + k σ } Pr ˆμ k σ <μ < ˆμ + k σ } ( ) ] ˆμ μ σ / N = = α In general, we don t know the pdf If we can assume the process is Gaussian and IID, we know the pdf (sampling distribution) of the estimator If N is large and the distribution doesn t have heavy tails, the distribution of ˆμ is Gaussian by the Central Limit Theorem (CLT) Sample Mean Confidence Intervals Comments Pr ˆμ k σ <μ < ˆμ + k σ } = α In many cases the confidence intervals are accurate, even if they are only approimate We can choose k such that α equals any probability we like In general, the user picks α This controls how often the confidence interval does not cover μ 95% and 99% are common choices J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 3 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 4

Sample Mean Variance when Gaussian and IID Pr ˆμ k σ <μ < ˆμ + k σ } = α If σ is unknown (usually), must estimate from the data ˆσ = N [(n) ˆμ ] N n=0 The corresponding z score, has a different distribution If (n) is IID and Gaussian ˆμ μ ˆσ / N has a Students t distribution with v = N degrees of freedom Approaches a Gaussian distribution as v becomes large (> 0) E[ˆμ ]=μ Sample Mean Variance when Gaussian var(ˆμ )= N N l= N ( l ) γ (l) N If (n) is Gaussian but not IID, the sample mean is normal with mean μ The approimate confidence interval is given by a Guassian PDF Pr ˆμ k var(ˆμ ) <μ < ˆμ k } var(ˆμ ) = α J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 5 J. McNames Portland State University ECE 4/557 Estimation Theory Ver..6 6