STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist (or classical) school, 2- Bayesian school. Most of the methods you have seen so far, likely are frequentist. It is important to understand both approaches. Frequentist vs. Bayesian Methods In frequentist inference, probabilities are interpreted as long run frequencies. The goal is to create procedures with long run frequency guarantees. However, in Bayesian inference, probabilities are interpreted as subjective degrees of belief. The goal is to state and analyze your beliefs. Some differences between the Bayesian and frequentist (non-bayesian) approaches are as follows:

Page 2 of 7 To illustrate the difference, consider the following example. Example 0.1. Assume X i s i = 1,, n are a random sample from N(θ, 1). We know that an CI 95% for θ is given by P ( θ [ ]) X 1.96/ n, X + 1.96/ n = 0.95. (0.1) In fact, the unknown parameter θ is a fix quantity and the interval is random because it is a function of the data. Equation (0.1) means that the interval [ X 1.96/ n, X + 1.96/ n] will trap the true value θ with probability 0.95. In contrast, the Bayesian treats probability as beliefs, not frequencies. The unknown parameter θ is given a prior distribution, say π(θ), representing our subjective beliefs of θ. After observing X i s, we try to update our believes and compute the posterior distribution for π(θ x 1,..., x n ) (We will see how later on). One can calculate and report the following: P ( θ [ X 1.96/ n, [ X + 1.96/ n] x 1,..., x n ) = 0.95. (0.2) Note that the probability in equation (0.2) is a degree of belief statement about unknown parameter θ given X i s and it is not the same as equation (0.1). That is, if we repeated this experiment many times, the intervals would not trap the true value 95 percent of the time. 1 Example 0.2. Let θ be the probability of a particular coin landing on heads, and suppose we want to test the hypotheses ( with α = 0.05) H 0 : θ = 1 2 H 1 : θ > 1 2, suppose the following sequence of flips has been observed {H, H, H, H, H, T }. 2 To perform a frequentist hypothesis test, we must define a random variable to describe the data. The proper way to do this depends on exactly which of the following two experiments was actually performed: (a) Suppose the experiment was Flip six times and record the results. In this case, the random variable X counts the number of heads, so X Bin(6, θ). Here x = 5, and the p value = P(X 5 θ = 1 2 ) = 0.11 and since it is not less than α = 0.05, we won t reject H 0. 1 We will be seeing the equation (0.2) is called the credible set (interval). 2 H: head, T : tale

Page 3 of 7 (b) In contrast, suppose the experiment was Flip until we get tails. In this case, X counts the number of the flip until occurrence of the first tail, so X Geo(1 θ) and the p value = P(X 6 θ = 1 2 ) = 0.031 and since it is less than α = 0.05, we reject H 0. The conclusions are different! In fact, the result of the hypothesis test depends on whether we would have stopped flipping if we had gotten a tails sooner. Remember that despite the results, the likelihood function for the observed value of x is the same for both experiments in (a) and (b) (up to a constant) 3 : P (x θ) θ 5 (1 θ). In a Bayesian approach we take the data into account only through this likelihood and therefore there would be a guarantee to provide same answers regardless of which experiment was being performed. Bayesian methods are widespread in statistics, specially some applied areas due to computational reasons rather than philosophical reasons. Furthermore all new techniques such as machine learning, data science and neural networks all are built on Bayes idea. Bayes theorem The word Bayesian dated back to the 18th century and English Reverend Thomas Bayes, who along with Pierre-Simon Laplace was among the first thinkers to consider the laws of chance and randomness in a quantitative, scientific way. Both Bayes and Laplace were aware of a relation that is now known as Bayes Theorem: 3 Often the times it is said both have the same kernel.

Page 4 of 7 Therefore, the Bayesian method comprises of the following principle steps: (1) Prior: Obtain the prior density P (θ) (or as another notation π(θ)) which expresses our knowledge about θ prior to observing the data. (2) Likelihood: Obtain the likelihood function P (y θ) (or L(θ; y)). This step simply describes the process giving rise to the data x in terms of θ. (3) Posterior: Apply Bayes theorem to derive posterior density P (θ y) which expresses all that is known about θ after observing the data x. (4) Inference: Derive appropriate inference statements from the posterior distribution e.g. point estimates, interval estimates, probabilities of specified hypotheses. Miscellaneous applications Bayesian statistics is applied in a large variety of different fields including: Economy (econometrics) to make decisions that optimize benefit under uncertainties Biostatistics enables experts to use their knowledge in the inference Machine learning often use nonparametric Bayesian models that adaptively becomes more complex as more data become available (We focus on parametric models in this course) Advantage and Disadvantages of being Bayesian Some advantages:

Page 5 of 7 Bayesian logic and interpretation are simple; scientific questions can often be easily framed as inferential questions. Bayesian inference is simple in principle and provides a single recipe for coherent inference, all based on the posterior. Utility of using prior information, allowing one to combine various sources of information, including constraints. Bayesian inference naturally deals with conditioning, marginalization, and nuisance parameters. Parameter uncertainty is naturally accounted for. Bayesian inference naturally meshes with decision theory. Modern computational techniques allow models to be fit under a Bayesian approach that cannot be fit in other ways. Bayesian results often have good frequentist properties and frequentist inference is sometimes a special case of Bayesian results under a particular prior. Complicated hierarchical models can be naturally constructed in a Bayesian framework. Bayesian inference naturally penalizes complex models. Bayesian inference can deal with multiple testing inherently if set up properly as a joint inference problem. Some disadvantages: Computing the posterior, while simple in theory, is often difficult and time consuming in practice. Bayesian inference is model based and classical methods and models may not generalize (partial likelihood, nonparametric testing, robust estimation, marginal models). Sensitivity to prior selection. Posterior may be heavily influenced by the priors (informative prior+ small data size). High computational cost. Simulation provide slightly different answers

Page 6 of 7 no guarantee of Markov Chain Mote Carlo (MCMC) convergence. In brief, Bayesian statistics may be preferable than frequentist statistics when research wants to combines knowledge modeling (info from expert, or pre-existing info) with knowledge discovery (data, evidence) to help with decision support (analytic, simulation, diagnosis, and optimization) and risk management. We try to motivate the use of priors on parameters and indeed motivate the very use of parameters. Definition 0.1. (Infinite exchangeability) We say that a sequence of rv s y 1, y 2,... is an infinitely exchangeable if, for any n, the joint probability p(y 1,..., y n ) is invariant to permutation of the indices. That is, for any permutation π, p(y 1,..., y n ) = p(y π1,..., y πn ). A key assumption of many statistical analyses is that the random variables being studied are independent and identically distributed (iid). Note that iid random variables are always infinitely exchangeable. However, the converse is not necessarily true. For example, let y 1, y 2,... be iid, and let y 0 be a non-trivial random variable independent of the rest. Then y 0 +y 1, y 0 +y 2,... is infinitely exchangeable but not iid. The strength of infinite exchangeability lies in the following theorem. Theorem 0.1. (De Finetti) A sequence of random variables y 1, y 2,... is infinitely exchangeable iff, for all n, for some measure P on θ. p(y 1,..., y n ) = n 1 P (y i θ) P (dθ), Clearly, since the product n 1 P (y i θ) is invariant to reordering, we have that any sequence distribution that can be written as n 1 P (y i θ) P (dθ) for all n must be (infinitely) exchangeable. The other direction, though, is much deeper. It says that if we have exchangeable data, then There must exist a parameter θ. here must exist a likelihood P (y θ). There must exist a distribution P on θ. Thus, the theorem provides an answer to the questions of why we should use parameters and why we should put priors on parameters.

Page 7 of 7 Example 0.3. (Bayes 1764) A billiard ball W rolls on a line of length one, with uniform probability of stopping anywhere. Stops at θ, i.e. θ U(0, 1). A second ball O is then rolled n times (same conditions) and Y denotes number of times O stopped left of W. What is the posterior of θ given y? π(θ y) π(θ) P (y θ) θ y (1 θ) n y Bet(y + 1, n y + 1), therefore θ y Bet(y + 1, n y + 1) and E(θ y) = y+1 n+2. It is also easy to show that the the maximum a posteriori (MAP) 4 and maximum likelihood (ML) estimators are ˆθ MAP = ˆθ ML = y n. Example 0.4. If Y Bin(n, θ) and θ Bet(α, β) ( α = β = 1 is the particular case of Example 0.1), then θ y Bet(y + α, n y + β). Remark 0.1. The Bayesian approach enjoys a specific kind of coherence in that the order in which i.i.d. observations are collected does not matter, but also that updating the prior one observation at a time, or all observations together, does not matter. In other words, π(θ y 1,..., y n ) = P (y n θ) π(θ y 1,..., y n 1 ) P (yn θ) π(θ y 1,..., y n 1 ) dθ = P (y n θ) P (y n 1 θ) π(θ y 1,..., y n 2 ) P (yn θ) P (y n 1 θ) π(θ y 1,..., y n 2 ) dθ =. = P (y n θ) P (y n 1 θ)... P (y 1 θ)π(θ) P (yn θ) P (y n 1 θ)... P (y 1 θ)π(θ) dθ. 4 argmin θ π(θ y) The End.