1 Hypothesis Testing and Model Selection

Size: px

Start display at page:

Download "1 Hypothesis Testing and Model Selection"

Drusilla Robertson
5 years ago
Views:

1 A Short Course on Bayesian Inference (based on An Introduction to Bayesian Analysis: Theory and Methods by Ghosh, Delampady and Samanta) Module 6: From Chapter 6 of GDS 1 Hypothesis Testing and Model Selection For Bayesians, model selection and model criticisms are extremely important inference problems. These problems are usually more difficult than point estimation or credible set construction. Suppose X θ has the density f(x θ), with θ being an unknown element in the parameter space Θ. Suppose that we are interested in comparing two models M and M 1, which are given by M : X has density f(x θ) where θ Θ M 1 : X has density f(x θ) where θ Θ 1. (1) For i =, 1, let g i (θ) be the prior density of θ, conditional on M i being the true model. Then to compare models M and M 1 on the basis of a random sample x = (x 1,..., x n ) we use the Bayes factor B 1 (x) = m (x)/m 1 (x), where m i (x) = f(x θ)g i (θ)dθ, i =, 1. (2) Θ i We also often use the notation BF 1 to denote this Bayes factor. Recall from Module 1 that if π = P π (M ) = P π (Θ ) and π 1 = 1 π = P π (M 1 ), then the posterior probability of M is { P (M x) = π 1 B1 (x)} 1. (3) π Thus, if conditional prior densities g and g 1 can be specified, we simply use the Bayes factor B 1 for model selection. If further π is also specified, we can use the posterior probability of M i, or the posterior odds ratio of M to M 1 for model selection. Consider a different model checking problem now, that of testing for normality. In its simplest form, the problem can be stated as checking whether a given random sample X 1,..., X n arose from a population having the normal distribution. In the setup given above in (1), we may write it as M : X is N(µ, σ 2 ) with arbitrary µ and σ 2 > M 1 : X does not have the normal distribution. (4) 1

2 The above model selection problem looks quite different from (1) above, because M 1 does not constitute a parametric alternative. Here Bayes factor or the posterior odds will not work. But we will later introduce Bayesian p value which works. Laplace approximation of Bayes factor: We approximate the Bayes factor B 1 by using Laplace approximation to the two marginal densities. Recall the marginal density under the model M i is m i (x) = f(x θ i )g i (θ i )dθ i. If θ i denotes the posterior mode of θ i, then by Taylor expansion log{f(x θ i )g i (θ i )} log{f(x θ i )g i ( θ i )} 1 2 (θ i θ i ) T H θi (θ i θ i ), where is the second-derivative matrix of log{f(x θ H θi i )g i (θ i )} wrt θ i evaluated at θ i. Then by Laplace approximation we get { m i (x) f(x θ i )g i ( θ i ) exp 1 } 2 (θ i θ i ) T (θ H θi i θ i ) dθ i = f(x θ i )g i ( θ i )(2π) p i 2 H θi 1 2. (5) The Bayes factor is often reported in log-scale and 2 log(b 1 ) is used as a evidential measure to compare the support provided by the data x for M relative to M 1. Using the above approximation we get { } { } f(x θ ) g ( θ ) 2 log(b 1 ) 2 log + 2 log f(x θ 1 ) g 1 ( θ 1 ) { +(p p 1 ) log(2π) + log H θ1 H θ }. (6) If ˆθ i denotes the MLE of θ i under the model M i, then since θ i ˆθ i = O(1/n), by ignoring all terms of order O(n 1 ), we have the following alternative approximation: { } { } f(x ˆθ ) g (ˆθ ) 2 log(b 1 ) 2 log + 2 log f(x ˆθ 1 ) g 1 (ˆθ 1 ) { } +(p p 1 ) log(2π) + log Hˆθ1. (7) Hˆθ If Hˆθi denotes the per unit observation observed Fisher information matrix for θ i in model M i, then using log( Hˆθi ) = p i log(n)+log( Hˆθi ), an approximation to (7) correct to O(1) 2

3 is { } f(x ˆθ ) 2 log(b 1 ) 2 log (p p 1 ) log n. (8) f(x ˆθ 1 ) This is the approximate Bayes factor based on the Bayesian information criterion (BIC) due to Schwarz (1978). The term (p p 1 ) log n can be considered a penalty for using a more complex model. A related criterion is { } f(x ˆθ ) 2 log(b 1 ) 2 log 2(p p 1 ), (9) f(x ˆθ 1 ) which is based on the Akaike information criterion (AIC), namely, AIC = 2 log f(x ˆθ) 2p for a model f(x θ). The penalty for using a complex model is not as drastic as that in BIC. 2 P value and Posterior Probability of H as Measures of Evidence Against the Null One particular tool from classical statistics that is very widely used in applied sciences for model checking or hypothesis testing is the P value. The basic idea behind R.A. Fisher s (1973) original (1925) definition of P value had a great deal of appeal. It is the probability under a (simple) null hypothesis of obtaining a value of a test statistic that is at least as extreme as that observed in the sample data. Suppose that it is desired to test H : θ = θ versus H 1 : θ θ, (1) and that a classical significance test is available and is based on a test statistic T (X), large values of which are deemed to provide evidence against the null hypothesis. If data X = x is observed, with corresponding t = T (x), the P value then is α = P θ {T (X) T (x)}. Remark 1. Fisher meant P value to be used as a measure of degree of surprise in the data relative to H. This use of P value as a post-experimental or conditional measure of statistical evidence seems to have some intuitive justification. However, Bayesians 3

4 have raised various objections to the use of P value as an evidence against H. The P value appears to be too strict against H. To a Bayesian the posterior probability of H summarizes evidence against H. In many common testing problems, the P value is smaller than the posterior probability of H by an order of magnitude. The reason for this is that the P value ignores the likelihood of the data under the alternative and takes into account not only the observed deviation of the data from the null hypothesis as measured by the test statistic but also more significant deviations. We give a simple example below where the P value can be quite different from the posterior probability. Example 1. Suppose for known σ 2, X θ N(θ, σ 2 /n). We wish to test H : θ = θ versus H 1 : θ θ. Let T = n( X θ )/σ be the test statistic with an observed value t = n( x θ )/σ. It can be checked that the P value is given by α = 2[1 Φ(t)] where Φ( ) is the standard normal cdf. For the point null hypothesis, on the set {θ θ }, let θ have the density (g 1 ) of N(µ, τ 2 ). Then under H 1, marginally, X N(µ, τ 2 + n 1 σ 2 ). If ρ = (σ/ n)/τ and η = (θ µ)/τ, then (2πn 1 σ 2 ) 1/2 exp[ n ( x θ 2σ B 1 = 2 ) 2 ] (2π(τ 2 + n 1 σ 2 )) 1/2 exp[ ( x µ)2 ] 2(τ 2 +n 1 σ 2 ) = ( σ2 + nτ 2 [ ) 1/2 exp n { ( x θ ) 2 ( x }] µ)2 σ 2 2 σ 2 σ 2 + nτ 2 = [ 1 + ρ 2 exp t2 2 + n( x θ + θ µ) 2 ] = [ 1 + ρ 2 exp t2 2 + t2 2 = { 1 + ρ 2 exp t2 2 = { 1 + ρ 2 exp 1 2 Now, if we choose µ = θ, τ = σ and π = 1/2, we get 2(σ 2 + nτ 2 ) σ 2 σ 2 + nτ + n(θ µ)( x θ ) + n(θ µ) 2 ] 2 σ 2 + nτ 2 2(σ 2 + nτ 2 ) nτ 2 σ 2 + nτ + θ µ n( x θ ) σ 2 τ σ [ (t ρη) 2 ]} (1 + ρ 2 ) η2. B 1 = { 1 + ρ 2 exp 1 [ t 2 ]} 2 (1 + ρ 2 ) nτ σ 2 + nτ 2 + (θ µ) 2 2τ 2 nτ 2 σ 2 + nτ 2 } and P (H x) = (1 + B 1 ) 1. For various values of t and n, the different measures of evidence, α = P -value, B = Bayes factor, and P = P (H x) are displayed in the table below (taken from GDS). It may be noted that the posterior probability of H varies between 4 and 5 times the corresponding P -value which is an indication of how different these two measures of evidence can be. 4

5 Table 1: Normal Example: Measures of Evidence n t α B P B P B P B P B P B P Interval null hypotheses and one-sided tests Closely related to a sharp null hypothesis H : θ = θ is an interval null hypothesis H : θ θ ɛ. The conflict between the P -values and the posterior probabilities remain if ɛ is small. Clearly, the disagreement also depends on the sample size n. However, the situation is somewhat different for one-sided null and alternative hypotheses. If θ is the normal mean, then with a uniform prior a direct calculation shows that the P -value for testing H : θ θ versus H : θ > θ is equal to the posterior probability of H. In general, these two values are not the same, and one may be higher or smaller than the other depending on the family of densities in the model. 2.2 Jeffreys-Lindley Paradox Suppose X 1,..., X n are iid N(θ, σ 2 ), σ 2 known, and consider testing H : θ = θ versus H 1 : θ θ. We now show that for a fixed prior density the conflict between the P -values and the posterior probabilities of H is enhanced as n goes to infinity. This is known as Jeffreys-Lindley Paradox. Without loss of generality take θ =. Consider a uniform prior density over some interval ( a, a). The posterior probability of H given X is P (H X) = π exp[ n X 2 /(2σ 2 )], (11) K where π is the specified prior probability of H and K = π exp[ n X 2 /(2σ 2 )] + 1 π 2a a a exp[ n( X θ) 2 /(2σ 2 )]dθ. Suppose the data is such that X = z α σ/ n where z α is the 1 αth quantile of standard normal. Then X is significant at level α. Also, for sufficiently large n, X is well within ( a, a) because X tends to zero as n increases. This leads to a a exp[ n( X θ) 2 /(2σ 2 )]dθ = σ (2π/n) 5

6 and hence from (11) P (H X) = π exp( zα/2) 2 π exp( zα/2) 2 + (1 π ) σ (2π/n). 2a Thus P (H X) 1 as n whereas the P -value is equal to α for all n. This is known as Jeffreys-Lindley paradox. This phenomenon would continue to hold with flat enough prior in place of uniform. Indeed, P -values cannot be compared across sample sizes or across experiments. Even a frequentist tends to agree that the conventional values of the significance level α like α =.5 or.1 are too large for large sample sizes. Jeffreys-Lindley paradox shows that for inference about θ, P -values and Bayes factors may provide contradictory evidence and hence can lead to opposite decisions. The evidence against H in the P -values seems unrealistically high. 3 Bayesian P -value Bayes factors and more appropriately posterior probabilities of hypotheses are in principle correct tools to measure evidence for or against some hypotheses. But, they are often hard to compute or even impossible to compute in the event the alternative is vaguely specified or not specified at all. Bayesian P -values have been proposed to deal with such problems. Let M be a target model, and departure from this model be of interest. If, under this model, X has density f(x η), η Ξ, then for a Bayesian with prior π on η, m π (x) = f(x η)π(η)dη, the prior predictive distribution is the actual predictive distribution of Ξ X. Therefore, if a model departure statistic T (X) is available, then one can define the prior predictive P -value (or tail area under the predictive distribution) as p = P mπ {T (X) T (x obs ) M }, where x obs is the observed value of X. This quantity is heavily influenced by the prior distribution. Example 2. Let X 1,..., X n be a random sample from N(θ, σ 2 ). Suppose H : θ =. Case (a): If σ 2 is assumed to be known as σ, 2 using the statistic X, the predictive P -value is P = P [ n X > n x obs θ =, σ] 2 n xobs = 2Φ( ). σ If σ 2 highly underestimates the actual model variance, then P is very small and the evidence against H is overestimated. 6

7 Case (b): If σ 2 is unknown and assigned an improper prior π(σ 2 ) = 1/σ 2, then m π (x) = ( f X (x σ 2 )π(σ 2 )dσ 2 exp( 1 2σ 2 x 2 i ) n/2, x 2 i )(σ 2 n/2 dσ2 ) σ 2 which is an improper density, thus completely disallowing computation of the prior predictive P -value. [This is not surprising since with any improper prior, the prior predictive density is improper.] Case (c): Consider an inverse gamma prior IG(ν, β) with density π(σ 2 ν, β) = (β ν /Γ(ν)) (σ 2 ) (ν+1) exp( β/σ 2 ), where ν, β are specified positive constants. Because T σ 2 N(, σ 2 ), under this prior the predictive density of T is then m π (t) = f T (t σ 2 )π(σ 2 ν, β)dσ 2 exp( 1 2σ 2 (β + t2 2 ))(σ2 ) (ν+1+1/2) dσ 2 (2β + t 2 ) (2ν+1)/2. If 2ν is an integer, the prior predictive distribution of T is T/( β/ν) t 2ν. Then p = P m π ( X x obs M ) ( = P m T n xobs ) π M β/ν β/ν ( n xobs ) = 2 1 F 2ν ( ), β/ν where F 2ν is the cdf of t 2ν. For n x o bs = 1.96 and various values of ν and β, the corresponding values of the prior predictive P -values are displayed in Table 2. Table 2: Normal Example: Prior Predictive P -values ν β P Further, note that P 1 as β for any fixed ν >. Thus the prior predictive P -value in this example depends crucially on the values of β and ν. 7

8 What we learn from this example is that if the prior π used is a poor choice, even an excellent model can come under suspicion based on prior predictive P -value. We have seen that an improper prior always produces an improper predictive density, which is an undesirable feature. To rectify these problems, modification has been suggested by replacing π in m π by π(η x obs ) to define posterior predictive density and posterior predictive P -value: m (x x obs ) = f(x η)π(η x obs )dη P = P m ( x obs ) (T (X) T (x obs )). Example 3. (Example 2 continued). We consider the noninformative prior π(σ 2 ) 1/σ 2 again. Then, as before, because T σ 2 N(, σ 2 ), and π(σ 2 x obs ) exp( 1 2σ 2 leading to the posterior predictive density of T as x 2 i )(σ 2 ) n+2 2, m π (t x obs ) = ( 1 + f T (t σ 2 )π(σ 2 x obs )dσ 2 (σ 2 ) 1/2 exp( t2 2σ 2 ) exp( 1 2σ 2 t 2 ) (n+1)/2. n x2 i x 2 i )(σ 2 ) n+2 2 dσ 2 This implies that the posterior predictive distribution of T is T n x2 i /n t n. Thus the posterior predictive P -value is P = P m π( x obs ) ( X x obs M ) ( = P m π( x obs ) T n x2 i /n x obs ) n x2 i /n M ( ( x obs ) ) = 2 1 F n n x2 i /n, where F n is the cdf of t n distribution. This definition of a Bayesian P -value doesn t seem satisfactory. Let x obs. Note that then P 2(1 F n ( n)). 8

9 Table 3: Values of p n = 2(1 F n ( n)) n p n Note that these values have no serious relationship with the observations and hence cannot be really used for model checking. Bayarri and Berger (1998) attributed this behavior to the double use of data. Here we have used x in computing the posterior distribution and the tail area probability of the posterior predictive distribution. In an effort to combine the desirable features of the prior predictive P -value and the posterior predictive P -value and eliminate the undesirable features, Bayarri and Berger introduce the conditional predictive P -value. This quantity is based on the prior predictive distribution m π but is more heavily influenced by by the model than the prior. Noninformative priors can be used and there is no double use of the data. In this approach an appropriate statistic U(X) not involving the model departure statistic T (X), is identified. The conditional predictive density m π (t u) is derived, and the conditional predictive P -value is defined as P c = P m( u obs) (T (X) T (x obs )), where u obs = U(x obs ). They considered the following example. Example 4. (Last example continued). Here T = n X is the model departure statistic for checking discrepancy of the mean in the normal model. Let U(X) = s 2 = n (X i X) 2 /n. Note that nu σ 2 σ 2 χ 2 n 1. Consider the noninformative prior π(σ 2 ) 1/σ 2. Then π(σ 2 s 2 ) (σ 2 ) (n 1)/2 1 exp( ns 2 /2(σ 2 )) is an inverse gamma density with shape parameter (n 1)/2. It can be checked that the conditional predictive density of T given s 2 obs is m π (t s 2 obs) = ( n f T (t σ 2 )π(σ 2 s 2 obs)dσ 2 t 2 ) n/2. s 2 obs Then the conditional predictive P -value is ( ( n 1 x obs ) ) P c = 2 1 F n 1. s obs We have found a Bayesian interpretation for the classical P -value from the usual t-test. Note that s 2 obs was used to produce the posterior distribution to eliminate σ2, and that 9

10 x obs was then used to compute the tail area probability. In this example, it is easy to identify U(X). In some problems it is not as easy, and also the computation of the conditional posterior predictive density is not straightforward. An alternative possibility is to use the partial posterior predictive P -value defined by P = P m ( ) (T (X) T (x obs )), where the predictive density m is obtained from a partial posterior density calculated from the conditional likelihood of X given T (X) = T (x obs ). The partial posterior π )(η) f X T (x obs t obs, η)π(η). For the normal example, check that f X X(x obs x obs, σ 2 ) (σ 2 ) (n 1)/2 exp( n 2σ 2 s2 obs). Thus, for π(σ 2 ) 1/σ 2, the π (σ 2 ) (σ 2 ) (n 1)/2 1 exp( n s 2 2σ 2 obs ). In this example, the partial predictive P -value is the same as the conditional predictive P -value. 4 Nonsubjective Bayes factors Consider two models M and M 1, under model M i the density for the data X is f i (x θ i ), θ i being an unknown parameter of dimension p i, i =, 1. Given the prior (proper) density g i (θ i ) for parameter θ i, the Bayes factor for M 1 relative to M is given by B 1 = m 1(x) m (x) = f1 (x θ 1 )g 1 (θ 1 )dθ 1 f (x θ 1 )g (θ )dθ, (12) where m i (x) is the marginal density of X under M i. If the priors g i (θ i ) cannot be subjectively specified, one tends to use a noninformative prior. There are difficulties with (12) for noninformative priors that are typically improper. If g i is improper, it is defined only up to a positive multiplicative constant c i, and c i g i has as much validity as g i. This means that (c 1 /c )B 1 has as much validity as B 1 as a Bayes factor. Thus the Bayes factor remains indeterminate for improper prior(s). This indeterminacy has been the main motivation of new objective methods. We will confine to the nested case where f and f 1 are of the same functional form and f (x θ ) is the same as the f 1 (x θ 1 ) with some of the coordinates of θ 1 specified. We show below that the use of a diffuse proper prior in place of an improper prior does not rectify the abovementioned deficiency of the Bayes factor. Example 5. (Testing normal mean with known variance.) Suppose we observe X = (X 1,..., X n ). Under M, X i are iid N(, 1) and under M 1, X i are iid N(θ, 1), θ real and 1

11 unknown. With the uniform noninformative prior g1 N (θ) = c for θ under M 1, the Bayes factor 2π B1 N 2 X = c exp(n n 2 ). For a uniform prior for θ over [ K, K] for large K, the new Bayes factor B K 1 satisfies B K 1 = BN 1 2Kc. Thus for large K, the Bayes factor B K 1 is biased against M 1. A similar conclusion is obtained if we use a diffuse proper prior N(, τ 2 ) with τ 2 large. The Bayes factor is B norm [ 1 = (nτ 2 + 1) 1/2 nτ 2 n exp X 2 ] nτ which is approximately (nτ 2 ) 1/2 exp[n X 2 /2] for large nτ 2. This can be made arbitrarily small by taking τ 2 arbitrarily large. This is clearly undesirable. A solution to this instability of a Bayes factor with a noninformative prior is to use part of the data as a training sample by dividing X = (X 1, X 2 ). We assume independence of X 1 and X 2. The first subset X 1 is treated as a training sample to convert a diffused (improper) prior into a proper posterior distribution for the parameters given X 1. For the the prior g i (θ i ), the (training) posterior is g i (θ i X 1 ) = f i(x 1 θ i )g i (θ i ) fi (X 1 θ i )g i (θ i )dθ i, i =, 1. These proper posteriors are then used as priors to compute the Bayes factor with the remaining data X 2. The conditional Bayes factor B 1 (X 1 ) conditioned on X 1, can be expressed as B 1 (X 1 ) = = f1 (X 2 θ 1 )g 1 (θ 1 X 1 )dθ 1 f (X 2 θ )g (θ X 1 )dθ f1 (X 2 θ 1 )f 1 (X 1 θ 1 )g 1 (θ 1 )dθ 1 /m 1 (X 1 ) f (X 2 θ )f (X 1 θ )g (θ )dθ / m (X 1 ) = m 1(X) m (X 1 ) m (X) m 1 (X 1 ) = B 1 m (X 1 ) m 1 (X 1 ), (13) where m i (X 1 ) is the marginal density of X 1 under M i, i =, 1. Note that if the priors c i g i, i =, 1 are used to compute B 1 (X 1 ), the arbitrary constant multiplier c 1 /c of B 1 is cancelled by (c /c 1 ) of m (X 1 )/m 1 (X 1 ) so that the indeterminacy of the Bayes factor is removed in (13). 11

12 It follows from the preceding discussion that X 1 may be used as a training sample if the corresponding posteriors g i (θ i X 1 ), i =, 1 are proper or, equivalently, the marginal densities m i (X 1 ) of X 1 under M i, i =, 1 are finite. Clearly, one should use the minimal amount of data as training sample and use the most of the data for model comparison. Berger and Pericchi and defined a minimal training sample if < m i (X 1 ) <, i =, 1 and the marginal is not finite for any subset of X 1. Example 6. Consider testing the normal mean equal to zero for known variance. Under the uniform noninformative prior g 1 (θ 1 ) = 1 under M 1, the minimal training samples are subsamples of size 1 with m (X i ) = (1/ 2π exp( X 2 i /2) and m 1 (X i ) = The intrinsic Bayes factor and the fractional Bayes factor We have described conditional Bayes factor B 1 (X 1 ) above corresponding to an improper prior through a minimal training sample X 1. However, this choice depends on X 1, for which there may be many possibilities. To remove arbitrariness, Berger and Pericchi suggested considering conditional Bayes factors for all possible training samples to define an intrinsic Bayes factor. If X(l), l = 1,..., L denote the list of all possible minimal training samples, they defined the arithmetic intrinsic Bayes factor (AIBF) as AIBF 1 = B 1 1 L L l=1 m (X(l)) m 1 (X(l)). (14) ( The geometric intrinsic Bayes factor GIBF 1 = B m ) (X(l)) 1/L. 1 m 1 (X(l)) A different solution for the model selection problem based on improper priors is due to O Hagan. He proposed using a fractional power of the likelihood to convert an improper prior into a proper posterior. Then used this posterior to combine with the other fraction of the likelihood to obtain the marginal densities under the models to create the Bayes factor. The resulting partial Bayes factor, called the fractional Bayes factor (FBF), is given by F BF 1 = m 1(X, b) m (X, b), where < b < 1 is appropriately chosen and m i (X, b) = f 1 b i (X θ i )fi b (X θ i )g i (θ i )dθ i = f b i (X θ i )g i (θ i )dθ i fi (X θ i )g i (θ i )dθ i f b i (X θ i )g i (θ i )dθ i. Note that F BF 1 can also be written as F BF 1 = B 1 m b (X) m b 1(X), 12

13 where m b i(x) = f b i (X θ i )g i (θ i )dθ i, i =, 1. To make FBF comparable with the IBF, we can take b = m/n, where m is the size of a minimal training sample. O Hagan also recommends other choices of b such as n/n or log n/n. Example 7. Consider testing the normal mean equal to zero for known variance. The Bayes factor with the noninformative prior g 1 (θ 1 ) = 1 is 2π 2 X B 1 = exp(n n 2 ). Hence, B 1 (X i ) = B 1 m (X i )/m 1 (X i ) = B 1 (1/ 2π) exp( X i 2 2 ). Thus AIBF 1 = n 1 B 1 (X i ) = n 3/2 exp(n X 2 /2) exp( X i 2 2 ), GIBF 1 = n 1/2 exp[n X 2 /2 (1/2n) Xi 2 ]. Note that for a fraction < b < 1, ( 1 ) bn m b b X (X) = exp[ i 2 ] 2π 2 ( 1 ) bn m b b (X 1(X) = i X) 2 2π exp[ ] 2π 2 bn m b 2 nb X nb = exp[ ] 2 2π. Hence the FBF is m b 1 F BF 1 = b 1/2 exp[n(1 b) X 2 /2] = n 1/2 exp[(n 1) X 2 /2], if b = 1/n. See Chapter 6, pp , of GDS for more examples. Exercise: Suppose X 1, X 2 are iid with a location-scale pdf f(x µ, σ) = 1 σ f(x µ ), < µ <, σ >. σ Show that 1 σ f(x 1 µ )f( x 2 µ 1 )dµdσ = 3 σ σ 2 x 1 x 2. Note: This result was discovered through simulations. 13

7. Estimation and hypothesis testing. Objective. Recommended reading

7. Estimation and hypothesis testing Objective In this chapter, we show how the election of estimators can be represented as a decision problem. Secondly, we consider the problem of hypothesis testing