Fisher Information & Efficiency

Similar documents
1. Fisher Information

Maximum Likelihood Estimation

Maximum Likelihood Estimation

Statistical Inference

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Invariant HPD credible sets and MAP estimators

Final Examination a. STA 532: Statistical Inference. Wednesday, 2015 Apr 29, 7:00 10:00pm. Thisisaclosed bookexam books&phonesonthefloor.

Lecture 1: Introduction

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

A Very Brief Summary of Statistical Inference, and Examples

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

STAT 730 Chapter 4: Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation

5.2 Fisher information and the Cramer-Rao bound

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Brief Review on Estimation Theory

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Confidence & Credible Interval Estimates

A Few Notes on Fisher Information (WIP)

Classical Estimation Topics

1 Hypothesis Testing and Model Selection

Introduction to Bayesian Methods

Statistical Inference

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Chapter 3. Point Estimation. 3.1 Introduction

Mathematical Statistics

Statistical Theory MT 2007 Problems 4: Solution sketches

Review and continuation from last week Properties of MLEs

Statistics 3858 : Maximum Likelihood Estimators

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Statistical Theory MT 2006 Problems 4: Solution sketches

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Submitted to the Brazilian Journal of Probability and Statistics

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

Theory of Statistics.

Loglikelihood and Confidence Intervals

Suggested solutions to written exam Jan 17, 2012

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Exponential Families

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat 5101 Lecture Notes

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Maximum Likelihood Estimation

Lecture 17: Likelihood ratio and asymptotic tests

STATISTICAL METHODS FOR SIGNAL PROCESSING c Alfred Hero

Statistical Inference

Lecture 4 September 15

Integrated Objective Bayesian Estimation and Hypothesis Testing

Chapter 3 : Likelihood function and inference

Probability and Estimation. Alan Moses

Mathematical statistics

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

STAT215: Solutions for Homework 2

Poisson CI s. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Math 494: Mathematical Statistics

Principles of Statistics

A Very Brief Summary of Statistical Inference, and Examples

Chapter 8: Least squares (beginning of chapter)

simple if it completely specifies the density of x

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Objective Bayesian Hypothesis Testing

ECE 275A Homework 7 Solutions

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Chapters 9. Properties of Point Estimators

Introduction to Estimation Methods for Time Series models Lecture 2

Information geometry of mirror descent

ECE 275A Homework 6 Solutions

STA 294: Stochastic Processes & Bayesian Nonparametrics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

f(y θ) = g(t (y) θ)h(y)

MATH4427 Notebook 2 Fall Semester 2017/2018

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

STAT 425: Introduction to Bayesian Analysis

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Normalising constants and maximum likelihood inference

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

General Bayesian Inference I

EIE6207: Estimation Theory

DA Freedman Notes on the MLE Fall 2003

Statistical Data Analysis Stat 3: p-values, parameter estimation

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Foundations of Statistical Inference

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

9 Asymptotic Approximations and Practical Asymptotic Tools

Introduction: exponential family, conjugacy, and sufficiency (9/2/13)

Statistical Estimation

Transcription:

Fisher Information & Efficiency Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA 1 Introduction Let f(x θ) be the pdf of X for θ Θ; at times we will also consider a sample x = {X 1,,X n } of size n N with pdf f n (x θ) = f(x i θ). In these notes we ll consider how well we can estimate θ or, more generally, some function g(θ), by observing X or x. Let λ(x θ) = logf(x θ) be the natural logarithm of the pdf, and let λ (X θ), λ (X θ) be the first and second partial derivative with respect to θ (not X). In these notes we will only consider regular distributions, those with continuously differentiable (log) pdfs whose support doesn t depend on θ and which attain a maximum in the interior of Θ (not at an edge) where λ (X θ) vanishes. That will include most of the distributions we consider (normal, gamma, poisson, binomial, exponential,...) but not the uniform. The quantity λ (X θ) is called the Score (sometimes it s called the Score statistic but really that s a misnomer, since it usually depends on the parameter θ and statistics aren t allowed to do that). For a random sample x of size n, since the logarithm of a product is the sum of the logarithms, the Score is the sum λ n (x θ) = (/θ)logf n(x θ) = λ(x n θ). Usually the MLE ˆθ is found by solving the equation λ n(x ˆθ) = 0 for the Score to vanish, but today we ll use it for other things. For fixed θ, and evaluated at the random variable X (or vector x), the quantity Z := λ (X θ) (or Z n := λ n(x θ)) is a random variable; let s find its mean and variance. First consider a single observation, or n = 1. Let s consider continuously-distributed random variable X (for discrete distributions, just replace the integrals below with sums). Since 1 = f(x θ)dx for any pdf and for every θ Θ, we can take a derivative to find 0 = f(x θ)dx θ θf(x θ) = f(x θ)dx f(x θ) = λ (x θ) f(x θ)dx = E θ λ (X θ), (1) 1

so the Score always has mean zero. The same reasoning shows that, for random samples, E θ λ n(x θ) = 0. The variance of the Score is denoted I(θ) = E θ [ λ (X θ) ] () and is called the Fisher Information function. Differentiating (1) (using the product rule) gives us another way to compute it: 0 = λ (x θ) f(x θ)dx θ = λ (x θ) f(x θ)dx+ λ (x θ) f (x θ)dx = λ (x θ) f(x θ)dx+ λ (x θ) f (x θ) f(x θ)dx f(x θ) [ = E θ λ (X θ) ] [ +E θ λ (X θ) ] = E θ [ λ (X θ) ] +I(θ) by (), so I(θ) = E θ [ λ (X θ) ]. (3) Since λ n (x θ) = λ(x i θ) is the sum of n iid RVs, the variance I n (θ) = E θ [λ n (x θ) ] = ni(θ) for a random sample of size n is just n times the Fisher Information for a single observation. 1.1 Examples Normal: For the No(θ,σ ) distribution, λ(x θ) = 1 log(πσ ) 1 σ (x θ) λ (x θ) = 1 σ (x θ) λ (x θ) = 1 σ [ (X θ) ] I(θ) = E θ = 1 σ, σ 4 the precision (inverse variance). The same holds for a random sample I n (θ) = n/σ of size n. Thus the variance (σ /n) of ˆθ n = x n is exactly 1/I n (θ). Poisson: For the Po(θ), so again ˆθ n = x n has variance 1/I n (θ) = θ/n. λ(x θ) = xlogθ logx! θ λ (x θ) = x θ θ λ (x θ) = x θ [ (X θ) ] I(θ) = E θ = 1 θ, θ

Bernoulli: For the Bernoulli distribution Bi(1, θ), λ(x θ) = xlogθ +(1 x)log(1 θ) λ (x θ) = x θ 1 x 1 θ λ (x θ) = x θ 1 x (1 θ) I(θ) = E θ [ X(1 θ) +(1 X)θ θ (1 θ) so again ˆθ n = x n has variance 1/I n (θ) = θ(1 θ)/n. Exponential: For X Ex(θ), λ(x θ) = logθ xθ λ (x θ) = 1 θ x ] = 1 θ(1 θ), λ (x θ) = 1 θ [ ] (1 Xθ) I(θ) = E θ = 1 θ, so I n (θ) = n/θ. This time the variance of ˆθ = 1/ x n, θ n (n 1) (n ), is bigger than 1/I n(θ) = θ /n (you can compute this by noticing that x n Ga(n,nθ) has a Gamma distribution, and computing arbitrary moments E[Y p ] = β p Γ(α+p)/Γ(α) for any Y Ga(α,β), p > α). It turns out that this lower bound always holds. θ 1. The Information Inequality Let T(X) be any statistic with finite variance, and denote its mean by m(θ) = E θ T(X). By the triangle inequality, the square of the covariance of any two random variables is never more than the product of their variances (this is just another way of saying the correlation is bounded by ±1). We re going to apply that idea to the two random variables T(X) and λ (X θ) (whose variance is V θ [λ (X θ)] = I(θ)): V θ [T(X)] V θ [λ (X θ] [ { E θ [T(X) m(θ)] [λ (X θ) 0] }] [ = {[T(x)] [λ (x θ)] } f(x θ)dx] [ = T(x)f (x θ)dx] = [ m (θ) ], 3

so V θ [T(X)] [m (θ)] I(θ) or, for samples of size n, V θ [T n (x)] [m (θ)] ni(θ). For vector parameters θ Θ R d the Fisher Information is a matrix I(θ) = E θ [ λ(x θ) λ(x θ) ] = E θ [ λ(x θ)] where f(θ) denotes the gradient of a real-valued function f : Θ R, a vector whose components arethepartial derivatives f(θ)/θ i ; wherex denotes the1 dtransposeof thed 1 columnvector x; and where f(θ) denotes the Hessian, or d d matrix of mixed second partial derivatives. If T(X) was intended as an estimator for some function g(θ), the MSE is E θ { [Tn (x) g(θ)] } = V θ [T n (x)]+[m(θ) g(θ)] [m (θ)] ni(θ) +[m(θ) g(θ)] = [β (θ)+g (θ)] ni(θ) +β(θ) where β(θ) m(θ) g(θ) denotes the bias. In particular, for estimating g(θ) = θ itself, E θ { [Tn (x) θ] } [β (θ)+1] ni(θ) +β(θ) and, for unbiased estimators, the MSE is always at least: E θ { [Tn (x) θ] } 1 ni(θ). These lower bounds were long thought to be discovered independently in the 1940s by statisticians Harold Crámer and Calyampudi Rao, and so this is often called the Crámer-Rao Lower Inequality, but ever since Erich Lehmann brought to everybody s attention their earlier discovery by Maurice Fréchet in the 1870s they became called the Information Inequality. We saw in examples that the bound is exactly met by the MLEs for the mean in normal and Poisson examples, but the inequality is strict for the MLE of the rate parameter in an exponential (or gamma) distribution. It turns out there is a simple criterion for when the bound will be sharp, i.e., for when an estimator will exactly attain this lower bound. The bound arose from the inequality ρ 1 for the covariance ρ of T(X) and λ (X θ); this inequality will be an equality precisely when ρ = ±1, i.e., when T(X) can be written as an affine function T(X) = u(θ)λ (X θ) + v(θ) of the Score. The 4

coefficients u(θ) and v(θ) can depend on θ, but not on X, but any θ-dependence has to cancel so that T(X) won t depend on θ (because it was a statistic). In the Normal and Poisson examples the statistic T was X, which could indeed be written as an affine function of λ (X θ) = (X θ)/σ for the Normal or λ (X θ) = (X θ)/θ for the Poisson, while in the Exponential case 1/X cannot be written as an affine function of λ (X θ) = (1/θ X). 1..1 Multivariate Case A multivariate version of the Information Inequality exists as well. If Θ R k for some k N, and if T : X R n is an n-dimensional statistic for some n N for data X f(x θ) taking values in a space X of arbitrary dimension, define the mean function m : R k R n by m(θ) := E θ T(X) and its n k Jacobian matrix by J ij := m i (θ)/θ j. Then the multivariate Information Inequality asserts that Cov θ [T(X)] J I(θ) 1 J where I(θ) := Cov θ [ θ logf(x θ)] is the Fisher information matrix, where the notation A B for n n matrices A,B means that [A B] is positive semi-definite, and where C denotes the k n transpose of an n k matrix C. This gives lower bounds on the variance of z T(X) for all vectors z R n and, in particular, lower bounds for the variance of components T i (X). Examples Normal Mean & Variance If both the mean µ and precision τ = 1/σ iid are unknown for normal variates X i No(µ,1/τ), the Fisher Information for θ = (µ,τ) is [ µ l I(θ) = E µτ l ] [ ] [ ] τ (X µ) τ 0 τµ l = E τ l (X µ) τ = / 0 τ / where l(θ) = 1 [log(τ/π) τ(x µ) ] is the log likelihood for a single observation. Multivariate Normal Mean If the mean vector µ R k is unknown but the covariance matrix Σ ij = E(X i µ i )(X j µ j ) is known for a multivariate normal RV X No(µ,Σ), the (k k)-fisher Information matrix for µ is I ij (µ) = E µ i µ j { 1 log πσ 1 (X µ) Σ 1 (X µ) } = ( Σ 1) ij, so (as in the one-dimensional case) the Fisher Information is just the precision (now a matrix). Gamma iid If both the shape α and rate λ are unknown for gamma variates X i Ga(α,λ), the Fisher Information for θ = (α,λ) is I(θ) = E [ µ l µτ l τµ l τ l ] [ ψ = E (α) λ 1 ] λ 1 αλ = [ ψ (α) λ 1 ] λ 1 αλ 5

where l(θ) = αlogλ+(α 1)logX λx logγ(α) is the log likelihood for a single observation, and where ψ (z) := [logγ(z)] is the trigamma function (Abramowitz and Stegun, 1964, 6.4), the derivative of the digamma function ψ(z) := [logγ(z)] (Abramowitz and Stegun, 1964, 6.3). 1.3 Regularity The Information Inequality requires some regularity conditions for it to apply: The Fisher Information exists equivalently, the log likelihood l(θ) := log f(x θ) has partial derivatives with respect to θ for every x X and θ Θ, and they are in L ( X,f(x θ) ) for every θ Θ. Integration (wrt x) and differentiation (wrt θ) commute in the expression θ T(x) f(x θ)dx = T(x) θ f(x θ)dx. T This latter condition will hold whenever the support {x : f(x θ) > 0} doesn t depend on θ, and logf(x θ) has two continuous derivatives wrt θ everywhere. These conditions both fail for the case of {X i } iid Un(0, θ), the uniform distribution on an interval [0,θ], and so does the conclusion of the Information Inequality. In this case the MLE is ˆθ n = X n(x) := max{x i : 1 i n}, the sample maximum, whose mean squared error E θ ˆθ n θ = T θ (n+1)(n+) tends to zero at rate n as n, while the Information Inequality bounds that rate below by a multiple of n 1 for problems satisfying the Regularity Conditions. Efficiency An estimator δ(x) of g(θ) is called efficient if it satisfies the Information Inequality exactly; otherwise its (absolute) efficiency is defined to be Eff(δ) = or, if the bias β(θ) [E θ δ(x) g(θ)] vanishes, = [β (θ)+g (θ)] I(θ) +β(θ) E θ {[δ(x) g(θ)] } [g (θ)] I(θ)V θ [δ(x)]. It is asymptotically efficient if the efficiency for a sample of size n converges to one as n. This happens for the MLE ˆθ n = 1/ x n for an Exponential distribution above, for example, whose bias is β n (θ) = E[ 1 x n θ] = θ n 1 and whose absolute efficiency is Eff(ˆθ n ) = [1/(n 1)+1] + θ n/θ (n 1) θ n (n 1) (n ) + θ = (n+1)(n ) (n+)(n 1). (n 1) This increases to one as n, so θ n is asymptotically efficient. 6

3 Asymptotic Relative Efficiency: ARE If one estimator δ 1 of a quantity g(θ) has MSE E θ δ 1 (x) g(θ) c 1 /n for large n, while another δ has MSE E θ δ (x) g(θ) c /n with c 1 < c, then the first will need a smaller sample-size n 1 = (c 1 /c )n to achieve the same MSE as the second would achieve with a sample of size n. The ratio (c /c 1 ) is called the asymptotic relative efficiency (or ARE) of δ 1 wrt δ. For example, if c = c 1, then δ 1 needs c 1 /c = 0.5 times the sample size, and is (c /c 1 ) = times more efficient than δ. Wewon t gointoit muchinthiscourse, butit s interestingtoknow thattheforestimatingthemean θ of the normal distribution No(θ,σ ) the sample mean ˆθ = x n achieves the Information Inequality lower bound of E θ x n θ = σ /n = 1/I n (θ), while the sample median M(x) has approximate 1 MSE E θ M( x θ σ ) θ πσ /n, so the ARE of the median to the mean is σ /n πσ /(n+) π 0.6366, so the sample median will require a sample-size about π/ 1.57 times larger than the sample mean would for the same MSE. The median does offer a significant advantage over the sample mean, however it is robust against model misspecification, e.g., it is relatively unaffected by a few percent (or up to half) of errors or contamination in the data. Ask me if you d like to know more about that. 4 Change of Variables for Fisher Information The Fisher information function θ I(θ) depends on the particular way the model is parameterized. For example, the Bernoulli distribution can be parametrized in terms of the success probability p, the logistic η = log( p 1 p ), or in angular form with θ = arcsin( p). The Fisher Information functions for these various choices will differ, but in a very specific way. Consider a statistical model that can be parametrized in either of two ways, X f(x θ) = g(x η), with θ = φ(η), η = ψ(θ) for one-dimensional parameter vectors θ and ψ, related by invertible differentiable 1:1 transformations. Then the Fisher Information functions I θ and I η in these two parameterizations are related 1 You can show this by considering Φ ( M( x θ )), which is the median of n iid uniform random variables and so σ has exactly a Be(α,β) distribution with α = β = n+1 distribution for odd n, hence variance 1/4(n +). Then use the Delta method to relate the variance of M( x θ ) to that of Φ(M(x θ)), V[Φ(M(x θ ))] σ σ σ [Φ (0)] V[M( x θ )] = σ 1 π V[M(x θ)], so 1 1 σ 4(n+) π V[M(x θ)] = 1 V[M(x)], or V[M(x)] πσ. σ πσ (n+) 7

by { ( ) } I θ (θ) = E logf(x θ) θ { ( ) } = E logg(x ψ(θ)) θ { ( ) } = E logg(x η)η η θ { ( ) } (η = E logg(x η) η θ ) = I η (η) ( J η θ), (4) thefisherinformationi η (η)intheη paramaterization timesthesquareofthejacobianj η θ := η/θ for changing variables. For example, the Fisher Information for Bernoulli random variable in the usual success-probability parametrization was shown in Section(1.1) to be I p (p) = 1/p(1 p). In the logistic parametrization it would be I η e η (η) = (1+e η ) = Ip( e η ) (J p 1+e η η ), for Jacobian J p η = p/η = e η /(1+e η ) and inverse transformation p = e η /(1+e η ), while in the arcsin parameterization the Fisher Information I θ (θ) = I p( sin (θ) ) (J p θ ) = 4, a constant, for Jacobian J p θ = p/θ = sinθcosθ and inverse transformation p = sin θ. 4.1 Jeffreys Rule Prior From (4) it follows that the unnormalized prior distributions π θ J(θ) = ( I θ (θ) ) 1/ and π η J (η) = ( I η (η) ) 1/ are related by πj θ (θ) = πη J (η) η/θ, exactly the way they should be related for the change of variables η θ. It was this invariance property that led Harold Jeffreys (1946) to propose π J (θ) as a default objective choice of prior distribution. Much later José Bernardo (1979) showed that this is also the Reference prior that maximizes the entropic distance from the prior to the posterior, i.e., the information to be learned from an experiment; see (Berger et al., 009, 015) for a more recent and broader view. In estimation problems with one-dimensional parameters a Bayesian analysis using π J (θ) I(θ) 1/ is widely regarded as a suitable objective Bayesian approach. 8

4. Examples using Jeffreys Rule Prior Normal: For the No(θ,σ ) distribution with known σ, the Fisher Information is I(θ) = 1/σ, a constant (for known σ ), so π J (θ) 1 is the improper uniform distribution on R and the posterior distribution for a sample of size n is π J (θ x) No( X n,σ /n) with posterior mean ˆθ J = X n, the same as the MLE. Poisson: For the Po(θ), the Fisher Information is I(θ) = 1/θ, so π J (θ) θ 1/ is the improper conjugate Ga(1/,0) distribution and the posterior for a sample x of size n is π J (θ x) Ga ( 1 + X i,n ), with posterior mean ˆθ J = X n +1/n, asymptotically the same as the MLE. Bernoulli: For thebi(1,θ), thefisherinformation is I(θ) = 1/θ(1 θ), soπ J (θ) θ 1/ (1 θ) 1/ is the conjugate Be(1/,1/) distribution and the posterior for a sample x of size n is π J (θ x) Ga ( 1 + X i, 1 +n X i ), with posterior mean ˆθ J = ( X i + 1 )/(n +1). In the logistic parametrization the Jeffreys Rule prior is π J (η) 1/(e η/ +e η/ ) sech(η/), the hyperbolic secant, while the Jeffreys Rule prior is uniform π J (θ) 1 in the arcsin parametrization, but each of these induces the Be( 1, 1 ) for p under a change of variables. Exponential: For the Ex(θ), the Fisher Information is I(θ) = 1/θ, so the Jeffreys Rule prior is the scale-invariant improper π J (θ) 1/θ on R +, with posterior density for a sample x of size n is π J (θ x) Ga ( n, X i ), with posterior mean θ J = 1/ X n equal to the MLE. 4.3 Multidimensional Parameters When θ and η are d-dimensional a similar change-of-variables expression holds, but now I θ, I η and the Jacobian J η θ are all d d matrices: {( )( )} Iij θ (θ) = E logf(x θ) logf(x θ) θ i θ j {( )( )} = E logg(x ψ(θ)) logg(x ψ(θ)) θ i θ j = E {( I θ (θ) = J η θ I η (η)j η θ, k η k logg(x η) η k θ i )( l η l logg(x η) η l θ j )} 9

Jeffreys noted that now the prior distribution π J (θ) deti(θ) 1/ proportional to the square root of the determinant of the Jacobian matrix is invariant under changes of variables, but both he and later authors noted that π J (θ) has some undesirable features in d dimensions including the important example of No(µ,σ ) with unknown mean and variance. Reference priors (Berger and Bernardo, 199a,b) are currently considered the best choice for objective Bayesian analysis in multi-parameter problems, but they are challenging to compute and to work with. The tensor product of independent one-dimensional Jeffreys priors π J (θ 1 )π J (θ ) π J (θ d ) are frequently recommended as an acceptable alternative. References Abramowitz, M. and Stegun, I. A., eds. (1964), Handbook of Mathematical Functions With Formulas, Graphs, and Mathematical Tables, Applied Mathematics Series, volume 55, Washington, D.C.: National Bureau of Standards, reprinted in paperback by Dover (1974); on-line at http://www.math.sfu.ca/~cbm/aands/. Berger, J. O. and Bernardo, J. M. (199a), On the development of the Referrence Prior method, in Bayesian Statistics 4, eds. J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, Oxford, UK: Oxford University Press, pp. 35 49. Berger, J. O. and Bernardo, J. M. (199b), Ordered group reference priors with application to a multinomial problem, Biometrika, 79, 5 37. Berger, J. O., Bernardo, J. M., and Sun, D. (009), The Formal Definition of Reference Priors, Annals of Statistics, 37, 905 938. Berger, J. O., Bernardo, J. M., and Sun, D. (015), Overall Objective Priors, Bayesian Analysis, 10, 189 1, doi:10.114/14-ba915. Bernardo, J. M. (1979), Reference posterior distributions for Bayesian inference(with discussion), Journal of the Royal Statistical Society, Ser. B: Statistical Methodology, 41, 113 147. Jeffreys, H. (1946), An Invariant Form for the Prior Probability in Estimation Problems, Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 186, 453 461, doi:10.1098/rspa.1946.0056. Last edited: February 11, 016 10