An example to illustrate frequentist and Bayesian approches

Frequentist_Bayesian_Eample An eample to illustrate frequentist and Bayesian approches This is a trivial eample that illustrates the fundamentally different points of view of the frequentist and Bayesian approaches. Consider a data set of measurements: {, = 1,..., } Assume that the are each drawn from a random variable that is identically and independently distributed (i.i.d) with some probability density function The sample mean is = 1. This might be something as simple as the average of the heights of everyone in the room. Often we aren't really interested in the average of the particular entries, but rather what it tells us about some "true" mean of a large population of people. Or we might want to compare heights of one sample with another. E.g. are astronomers taller on average than, say, engineers? (Hypothesis testing). Frequentist view: The sample mean is considered to be the of a particular realization of the data values. If we had a different set of people we would get a different but statistically similar. The underlying notion is that there is an infinite ensemble of realizations and if we repeat an eperiment = "obtain height values," enough times (possibly infinite) we would learn about the ensemble average. The name 'frequentist' is given because the frequency of occurrence of among realizations is an estimate of the ensemble average (with caveats). The ensemble is described by the probability density function (PDF) d f () = 1. X The cumulative distribution function (CDF) is the integral F X () = d () f X and ranges between 0 and 1. The ensemble average of any of the is = d f X () that is normalized to unity: How does the sample mean relate to the ensemble average? In the following we will also need the second moment 2 from which we have the variance f X (). Var( ) = ( ( ). http://localhost:8888/nbconvert/html/frequentist_bayesian_eample.ipynb?download=false 2 Page 1 of 6

Frequentist_Bayesian_Eample Using those definitions we can calculate the ensemble mean and variance of the sample mean : We also want to know the variance of : It can be shown that Var() σx 2 = d ( ) 2 f (). X = 1 = (You can show this) Var( ) σ 2 = 2 =. What this says is that the standard deviation of the sample mean is of an individual data point. We have rediscovered the ubiquitous " σ 2 σ 2 2 1/ smaller than the standard deviation " law! Bayesian view: A Bayesian says, basically, you only have one data set (the particular data points) so live with it and figure out what your knowledge is about the ensemble average. That is, the Bayesian approach deals with probability as a statement of what you know about a parameter not as a frequency of occurrence of repeated eperiments. This may seem like a subtle difference. But in practice, while the frequentist approach provides point estimates like, the Bayesian approach gives a PDF for the "true mean." This same difference applies to much more complicated situations of model fitting and hypothesis testing. Thus, now we use the sample mean to infer knowledge about the true mean. The fundamental equation for Bayesian inference is based on (not surprisingly) Bayes' theorem that can be derived from conditional probabilities. Consider events A and B. The conditional probability that B occurs given that A occurs is We also have (inverting A and B) which means or P(A B) P(B A) =. P(A) P(A B) P(A B) =, P(B) P(B A)P(A) = P(A B)P(B) P(B A) = P(A B)P(B) P(A) OK, now lets get back to the sample mean and the true mean. We want to know given that we have data that give us. So we make the following assignments: B = = A = 'data' = http://localhost:8888/nbconvert/html/frequentist_bayesian_eample.ipynb?download=false Page 2 of 6

Frequentist_Bayesian_Eample A = 'data' = So now we have P( ) = P()P( ) P( ) To be useful we need to say what we mean by the various probabilities: The left hand side is the posterior probability of given the data. P() is the prior probability of (i.e. what we knew about it before acquiring any data; maybe we knew nothing. Or maybe we know that peoples' heights are bracketed and so there are constraints on that can be made. P( ) is the probability of having gotten the data given some value of the parameter. It is an assumption in the setup that all the data derive from the same PDF with true mean. The denominator of the right hand side is the probability of the data given all possible values of. That's a bit confusing. What we really need is for the probabilities to all fall between 0 and 1, so the denominator in this contet is really normalization. ow, we etend our point of view so that the entries in Bayes' theorem can be probability density functions. I prefer to use notation like f X () or f ( ) for PDFs but the Bayesian literature typically uses regardless of whether it means a probability or a probability density function. So for the problem at hand we could use PDFs as Usually the PDF of the data on the right-hand side is called the likelihood function so we will rename it as () = f ( ) giving f () () f( ) =. d f () () Thus we have a plausible epression that says the posterior PDF of the true mean is given by the product of its PDF prior to acquiring data multiplied by the likelihood function, which includes data that presumably (hopefully!) increase our knowledge about. A particular case is where each data point is distributed with a Gaussian PDF ( )2 /2σ 2 P() f () f ( ) f( ) =. d () f ( ) How do we construct the likelihood function? With independent data points the likelihood function is given by the product (since data points are independent by assumption) f f () = (2πσ) 1/2 e. P http://localhost:8888/nbconvert/html/frequentist_bayesian_eample.ipynb?download=false Page 3 of 6

Frequentist_Bayesian_Eample () = f ( ) = (2πσ = (2πσ ) 1/2 ( e ) 2 /2σ 2 ) /2 e =1 We can manipulate the eponent of the last epression by using = ( ) + ( ) which then gives the likelihood as () = (2πσ) /2 e σˆ2 /2σ 2 ( e 2 )/2σ 2 where the sample variance is ) 2 σˆ2 = 1 (. ( /2 ) 2 σ 2 Only the factor involving matters because normalization of the right hand side of the posterior PDF causes other factors to cancel. We then have ( ) 2 /2σ 2 f( ) f () e ote that the data appear in the likelihood function only via. ote also that the posterior PDF gives us a functional form for the PDF of, as opposed to the point estimate from the frequentist approach. We can obtain a point estimate from the posteriod PDF by finding the maimum of. For a flat prior where we really don't know anything about before acquiring data, the maimum of the posterior PDF is at =. But the posterior also tells us that there is uncertainty about determined by and the number of data points: σ = σ/. The maimum likelihood estimate for is simply the value where maimizes. This is ust. f( ) σ = In [71]: %matplotlib inline from numpy import * import scipy import matplotlib import matplotlib.pyplot as plt import astropy from scipy import constants as spconstants from scipy.special import gamma randn = random.randn In [72]: = 10 mu = 1.2 sigma = 1 muvec = arange(0., 3, 0.01) vec = randn() + mu bar = vec.mean() posterior_flat_prior = ep(-*(bar - muvec)**2/(2*sigma**2)) http://localhost:8888/nbconvert/html/frequentist_bayesian_eample.ipynb?download=false Page 4 of 6

Frequentist_Bayesian_Eample In [73]: print print bar 10 1.10537446678 In [74]: plt.plot(muvec, posterior_flat_prior) plt.plot((bar, bar), (0., 1.), '--', label=r'$ \overline{} $') plt.plot((mu, mu), (0., 1.), '--', label=r'$\rm \mu = true \ mean $') plt.label(r'$\mu $', fontsize=18) plt.ylabel(r'$\rm \propto \ posterior \ PDF \ of \ \mu $', fontsize= 18) plt.title(' = %d samples'%()) plt.legend(loc=1) plt.show() http://localhost:8888/nbconvert/html/frequentist_bayesian_eample.ipynb?download=false Page 5 of 6

Frequentist_Bayesian_Eample ote that For small the sample mean and true mean differ substantially In the frequentist view we epect the typical difference to be σ/. For = 10, this is about a 30% error. In the Bayesian approach the width of the posterior PDF reflects this error For either approach one can infer the same thing about : Frequentist: = ± σ = ± σ / Bayesian: our knowledge of is contained in the posterior PDF, which can be integrated to give the CDF from which we can establish a confidence interval for such as = +δ + δ. Etensions to higher dimensions This is a simple one-dimensional case (one parameter, ) where we have pretended that we know (for the individual i ). More realistically, we both parameters would be unknown. In that case the likelihood function is (, σ) = (2πσ) /2 /2 e σˆ2 σ 2 ( )/2 e 2 σ 2 the posterior PDF for both parameters is 2 σ e σˆ2 /2σ 2 ( e 2 )/2σ 2 ow we have a two-dimensional PDF from which to make our conclusions. f(, σ, ) f(, σ). Real-world problems can etend to hundreds of parameters. avigating the posterior PDF to make conclusions is a big challeng. That is why methods like simulated annealing and Markov Chain Monte Carlo (MCMC) have been developed. σ http://localhost:8888/nbconvert/html/frequentist_bayesian_eample.ipynb?download=false Page 6 of 6