DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Size: px

Start display at page:

Download "DS-GA 1002 Lecture notes 11 Fall Bayesian statistics"

Ashlie Hollie Bennett
5 years ago
Views:

1 DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian statistics these parameters are modeled as random variables, which allows to quantify our uncertainty about them. 1 Learning Bayesian models We consider the problem of learning a parametrized distribution from a set of data. In the Bayesian framework, we model the data x R n as a realization of a random vector X, which depends on a vector of parameters Θ that is also random. This requires two modeling choices: 1. The prior distribution is the distribution of Θ, which encodes our uncertainty about the model before seeing the data.. The likelihood is the conditional distribution of X given Θ, which specifies how the data depend on the parameters. In contrast to the frequentist framework, the likelihood is not interpreted as a deterministic function of the parameters. Our goal when learning a Bayesian model is to compute the posterior distribution of the parameters Θ given X. Evaluating this posterior distribution at the realization x allows to update our uncertainty about Θ using the data. The following example illustrates Bayesian inference applied to fitting the parameter of a Bernoulli random variable from iid realizations. Example 1.1 Bernoulli distribution. Let x be a vector of data that we wish to model as iid samples from a Bernoulli distribution. Since we are taking a Bayesian approach we choose a prior distribution for the parameter of the Bernoulli. We will consider two different Bayesian estimators Θ 1 and Θ : 1. Θ 1 represents a conservative estimator in terms of prior information. We assign a uniform pdf to the parameter. Any value in the unit interval has the same probability density: { 1 for 0 θ 1, f Θ1 θ = 1 0 otherwise.

2 Prior distribution n 0 = 1, n 1 = n 0 = 3, n 1 = 1 n 0 = 91, n 1 = Posterior mean uniform prior Posterior mean skewed prior ML estimator Figure 1: Posterior distributions of the parameter of a Bernoulli for two different priors and for different data realizations.

3 . Θ is an estimator that assumes that the parameter is closer to 1 than to 0. We could use it for instance to capture the suspicion that a coin is biased towards heads. We choose a skewed pdf that increases linearly from zero to one, { θ for 0 θ 1, f Θ θ = 0 otherwise. By the iid assumption, the likelihood, which is just the conditional pmf of the data given the parameter of the Bernoulli, equals p X Θ x θ = θ n 1 1 θ n 0, 3 where n 1 is the number of ones in the data and n 0 the number of zeros see Example 1.3 in Lecture Notes 9. The posterior pdfs of the two estimators are consequently equal to where f Θ1 X θ x = f Θ 1 θ p X Θ1 x θ p X x f Θ1 θ p X Θ1 x θ = f u Θ 1 u p X Θ1 x u du θ n 1 1 θ n 0 = u un 1 1 u n 0 du = θn 1 1 θ n 0 β n 1 + 1, n 0 + 1, 7 f Θ X θ x = f Θ θ p X Θ x θ p X x θ n1+1 1 θ n 0 = u un u n 0 β a, b := du 8 9 = θn θ n 0 β n 1 +, n 0 + 1, 10 u 11 u a 1 1 u b 1 du 1 is a special function called the beta function or Euler integral of the first kind, which is tabulated. Figure shows the plot of the posterior distribution for different values of n 1 and n 0. It also shows the maximum-likelihood estimator of the parameter, which is just n 1 / n 0 + n 1 see Example 1.3 in Lecture Notes 9. For a small number of flips, the posterior pdf of Θ is skewed to the right with respect to that of Θ 1, reflecting the prior belief that the parameter is closer to 1. However for a large number of flips both posterior densities are very close. 3

4 Conjugate priors Both posterior distributions in Example 1.1 are beta distributions. Definition.1 Beta distribution. The pdf of a beta distribution with parameters a and b is defined as f β θ; a, b := { θ a 1 1 θ b 1 βa,b, if 0 θ 1, 0 otherwise, 13 where β a, b := u a 1 1 u b 1 du. 14 u In fact, the uniform prior of Θ 1 in Example 1.1 is also a beta distribution, the parameters are a = 1 and b = 1. This is also the case for the skewed prior of Θ : it is a beta distribution with parameters a = and b = 1. Since the prior and the posterior belong to the same family we can view the posterior as an updated estimate of the parameters of the distribution. When the prior and posterior are guaranteed to belong to the same family of distributions for a particular likelihood, the distributions are called conjugate priors. Definition. Conjugate priors. A conjugate family of distributions for a certain likelihood satisfies the following property: if the prior belongs to the family, then the posterior also belongs to the family. Beta distributions are conjugate priors when the likelihood is binomial. Theorem.3 The beta distribution is conjugate to the binomial likelihood. If the prior distribution of Θ is a beta distributions with parameters a and b and the likelihood of the data X given Θ is binomial with parameters n and x, then the posterior distribution of Θ given X is a beta distribution with parameters x + a and n x + b. 4

5 % 11.4% Figure : Posterior distribution of the fraction of Trump voters in New Mexico conditioned on the poll data in Example.4. Proof. f Θ X θ x = f Θ θ p X Θ x θ 15 p X x f Θ θ p X Θ x θ = f 16 u Θ u p X Θ x u du θ a 1 1 θ b 1 n x θ x 1 θ n x = u ua 1 1 u b 1 17 n x ux 1 u n x du θ x+a 1 1 θ n x+b 1 = 18 u ux+a 1 1 u n x+b 1 du = f β θ; x + a, n x + b. 19 Note that the posteriors obtained in Example 1.1 follow immediately from the theorem. Example.4 Poll in New Mexico. In a poll in New Mexico with 449 participants, 7 people intend to vote for Clinton and 0 for Trump the data are from a real poll 1, but for simplicity we are ignoring the other candidates and people that were undecided. Our aim 1 The poll results are taken from 5

6 is to use a Bayesian framework to predict the outcome of the election in New Mexico using these data. We model the fraction of people that vote for Trump as a random variable Θ. We assume that the n people in the poll are chosen uniformly at random with replacement from the population, so given Θ = θ the number of Trump voters is a binomial with parameters n and θ. We don t have any additional information about the possible value of Θ, so we assume it is uniform or equivalently a beta distribution with parameters a := 1 and b := 1. By Theorem.3 the posterior distribution of Θ given the data that we observe is a beta distribution with parameters a := 03 and b := 8. The corresponding probability that Θ 0.5 is 11.4%, which is our estimate for the probability that Trump wins in New Mexico. 3 Bayesian estimators As we have seen, the Bayesian approach to learning probabilistic models yields the posterior distribution of the parameters of interest, as opposed to a single estimate. In this section we describe two possible estimators that are derived from the posterior distribution. 3.1 Minimum mean-square-error estimation The mean of the posterior distribution is the conditional expectation of the parameters given the data. Choosing the posterior mean as an estimator for the parameters Θ has a strong theoretical justification: it is guaranteed to achieve the minimum mean square error MSE among all possible estimators. Theorem 3.1 The posterior mean minimizes the MSE. The posterior mean is the minimum mean-square-error MMSE estimate of the parameter Θ given the data X. To be more precise, let us define For any arbitrary estimator θ other x, θ MMSE x := E Θ X = x. 0 E θ other X Θ E θ MMSE X Θ. 1 6

7 Proof. We begin by computing the MSE of the arbitrary estimator conditioned on X = x in terms of the conditional expectation of Θ given X, E θ other X Θ X = x = E θ other X θ MMSE X + θ MMSE X Θ X = x 3 = θ other x θ MMSE x + E θ MMSE X Θ X = x 4 + θ other x θ MMSE x E θ MMSE x E Θ X = x = θ other x θ MMSE x + E θ MMSE X Θ X = x. 5 By iterated expectation, E θ other X Θ = E E θ other X Θ X 6 = E θ other X θ MMSE X + E E θ MMSE X Θ X = E θ other X θ MMSE X + E θ MMSE X Θ 7 E θ MMSE X Θ, 8 since the expectation of a nonnegative quantity is nonnegative. Example 3. Bernoulli distribution continued. In order to obtain point estimates for the parameter in Example we compute the posterior means: E Θ 1 X 1 = x = θf Θ1 X θ x dθ = θn θ n 0 dθ 30 β n 1 + 1, n = β n 1 +, n β n 1 + 1, n 0 + 1, 31 E Θ X 1 = x = θf Θ X θ x dθ 3 0 = β n 1 + 3, n β n 1 +, n Figure shows the posterior means for different values of n 0 and n 1. 7

8 3. Maximum-a-posteriori estimation An alternative to the posterior mean is the posterior mode, which is the maximum of the pdf or the pmf of the posterior distribution. Definition 3.3 Maximum-a-posteriori estimator. The maximum-a-posteriori MAP estimator of a parameter Θ given data x modeled as a realization of a random vector X is θ MAP x := arg max p Θ X x 34 if Θ is modeled as a discrete random variable and θ MAP x := arg max f Θ X x 35 if it is modeled as a continuous random variable. In Figure the ML estimator of Θ is the mode maximum value of the posterior distribution when the prior is uniform. This is not a coincidence, under a uniform prior the MAP and ML estimates are the same. Lemma 3.4. The maximum-likelihood estimator of a parameter Θ is the mode maximum value of the pdf of the posterior distribution given the data X if its prior distribution is uniform. Proof. We prove the result when the model for the data and the parameters is continuous, if any or both of them are discrete the proof is identical in that case the ML estimator is the mode of the pmf of the posterior. If the prior distribution of the parameters is uniform, then f Θ is constant for any θ, which implies arg max f Θ X x = arg max = arg max f X Θ x θ f Θ f X Θ x θ f 36 u Θ u f X Θ x u du the rest of the terms don t depend on θ = arg max L x. 37 8

9 Note that uniform priors are only well defined in situations where the parameter is restricted to a bounded set. We now describe a situation in which the MAP estimator is optimal. If the parameter Θ can only take a discrete set of values, then the MAP estimator minimizes the probability of making the wrong choice. Theorem 3.5 MAP estimator minimizes the probability of error. Let Θ be a discrete random vector and let X be a random vector modeling the data. We define For any arbitrary estimator θ other x, P θ other X Θ θ MAP x := arg max p Θ X X = x. 38 In words, the MAP estimator minimizes the probability of error. P θ MAP X Θ. 39 Proof. We assume that X is a continuous random vector, but the same argument applies if it is discrete. We have P Θ = θ other X = f X x P Θ = θ other x X = x d x 40 x = f X x p Θ X θ other x x d x 41 x f X x p Θ X θ MAP x x d x 4 x = P Θ = θ MAP X 43 4 follows from the definition of the MAP estimator as the mode of the posterior. Example 3.6 Sending bits. We consider a very simple model for a communication channel in which we aim to send a signal Θ consisting of a single bit. Our prior knowledge indicates that the signal is equal to one with probability 1/4. p Θ 1 = 1 4, p Θ 0 = Due to the presence of noise in the channel, we send the signal n times. At the receptor we observe X i = Θ + Z i, 1 i n, 45 9

10 where Z contains n iid standard Gaussian random variables. Modeling perturbations as Gaussian is a popular choice in communications. It is justified by the central limit theorem, under the assumption that the noise is a combination of many small effects that are approximately independent. We will now compute and compare the ML and MAP estimators of Θ given the observations. The likelihood is equal to L x θ = = n f Xi Θ x i θ 46 n 1 π e x i θ. 47 It is easier to deal with the log-likelihood function, x i θ log L x θ = n log π. 48 Since Θ only takes two values, we can compare directly. We will choose θ ML x = 1 if log L x 1 = x i x i + 1 n log π 49 x i n log π 50 = log L x Equivalently, θ ML x = { 1 if 1 n n x i > 1, 0 otherwise. 5 The rule makes a lot of sense: if the sample mean of the data is closer to 1 than to 0 then our estimate is equal to 1. By the law of total probability, the probability of error of this estimator is equal to P Θ θ ML X = P Θ θ ML X Θ = 0 P Θ = 0 + P Θ θ ML X Θ = 1 P Θ = 1 1 = P x i > 1 n Θ = 0 1 P Θ = 0 + P x i < 1 n Θ = 1 P Θ = 1 = Q n/, 53 10

11 where the last equality follows from the fact that if we condition on Θ = θ the empirical mean is Gaussian with variance σ /n and mean θ see the proof of Theorem 3. in Lecture Notes 6. To compute the MAP estimate we must find the maximum of the posterior distribution of Θ given the observed data. Equivalently, we find the maximum of the logarithm of the posterior pdf this is equivalent because the logarithm is a monotone function, n log p Θ X θ x = log f Xi Θ x i θ p Θ θ 54 f X x = log f Xi Θ x i θ p Θ θ log f X x 55 = x i x i θ + θ n log π + log p Θ θ log f X x. 56 We compare the value of this function for the two possible values of Θ: 0 and 1. We choose θ MAP x = 1 if x i x i + 1 log p Θ X 1 x + log f X x = n log π log 4 57 x i n log π log 4 + log 3 58 Equivalently, θ MAP x = = log p Θ X 0 x + log f X x. 59 { 1 if 1 n n x i > 1 + log 3, n 0 otherwise. The MAP estimate shifts the threshold with respect to the ML estimate to take into account that Θ is more prone to equal zero. However, the correction term tends to zero as we gather more evidence, so if a lot of data is available the two estimators will be very similar. The probability of error of the MAP estimator is equal to P Θ θ MAP x = P Θ θ MAP x Θ = 0 P Θ = 0 + P Θ θ MAP x Θ = 1 P Θ = 1 1 = P x i > 1 n + log 3 n Θ = 0 P Θ = P x i < 1 n + log 3 n Θ = 1 P Θ = 1 = 3 n/ 4 Q log n/ n 4 Q log 3. 6 n 60 11

12 ML estimator MAP estimator Probability of error n Figure 3: Probability of error of the ML and MAP estimators in Example 3.6 for different values of n. We compare the probability of error of the ML and MAP estimators in Figure 3. MAP estimation results in better performance, but the difference becomes small as n increases. 1

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda Bayesian statistics DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist statistics