Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Size: px

Start display at page:

Download "Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda"

James Atkins
5 years ago
Views:

1 Bayesian statistics DS GA 1002 Statistical and Mathematical Models Carlos Fernandez-Granda

2 Frequentist vs Bayesian statistics In frequentist statistics the data are modeled as realizations from a distribution that depends on deterministic parameters In Bayesian statistics the parameters are modeled as random variables This allows to quantify our prior uncertainty and incorporate additional information

3 Learning Bayesian models Conjugate priors Bayesian estimators

4 Prior distribution and likelihood The data x R n are a realization of a random vector X, which depends on a vector of parameters Θ Modeling choices: Prior distribution: Distribution of Θ encoding our uncertainty about the model before seeing the data Likelihood: Conditional distribution of X given Θ

5 Posterior distribution The posterior distribution is the conditional distribution of Θ given X Evaluating the posterior at the data x allows to update our uncertainty about Θ using the data

6 Bernoulli distribution Goal: Estimating Bernoulli parameter from iid data We consider two different Bayesian estimators Θ 1 and Θ 2 : 1. Θ 1 is a conservative estimator with a uniform prior pdf { 1 for 0 θ 1 f Θ1 (θ) = 0 otherwise 2. Θ 2 has a prior pdf skewed towards 1 { 2 θ for 0 θ 1 f Θ2 (θ) = 0 otherwise

7 Prior distributions

8 Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is p X Θ ( x θ)

9 Bernoulli distribution: likelihood The data are assumed to be iid, so the likelihood is p X Θ ( x θ) = θ n 1 (1 θ) n 0 n 0 is the number of zeros and n 1 the number of ones

10 Bernoulli distribution: posterior distribution f Θ1 X (θ x)

11 Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x)

12 Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x) f Θ1 (θ) p X Θ1 ( x θ) = u f Θ 1 (u) p X Θ1 ( x u) du

13 Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x) f Θ1 (θ) p X Θ1 ( x θ) = u f Θ 1 (u) p X Θ1 ( x u) du θ n 1 (1 θ) n 0 = u un 1 (1 u) n 0 du

14 Bernoulli distribution: posterior distribution f Θ1 X (θ x) = f Θ 1 (θ) p X Θ1 ( x θ) p X ( x) f Θ1 (θ) p X Θ1 ( x θ) = u f Θ 1 (u) p X Θ1 ( x u) du θ n 1 (1 θ) n 0 = u un 1 (1 u) n 0 du = θn 1 (1 θ) n 0 β (n 1 + 1, n 0 + 1) β (a, b) := u a 1 (1 u) b 1 du u

15 Bernoulli distribution: posterior distribution f Θ2 X (θ x)

16 Bernoulli distribution: posterior distribution f Θ2 X (θ x) = f Θ 2 (θ) p X Θ2 ( x θ) p X ( x)

17 Bernoulli distribution: posterior distribution f Θ2 X (θ x) = f Θ 2 (θ) p X Θ2 ( x θ) p X ( x) θ n1+1 (1 θ) n 0 = u un1+1 (1 u) n 0 du

18 Bernoulli distribution: posterior distribution f Θ2 X (θ x) = f Θ 2 (θ) p X Θ2 ( x θ) p X ( x) θ n1+1 (1 θ) n 0 = u un1+1 (1 u) n 0 du = θn 1+1 (1 θ) n 0 β (n 1 + 2, n 0 + 1) β (a, b) := u a 1 (1 u) b 1 du u

19 Bernoulli distribution: n 0 = 1, n 1 =

20 Bernoulli distribution: n 0 = 3, n 1 =

21 Bernoulli distribution: n 0 = 91, n 1 = Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator

22 Learning Bayesian models Conjugate priors Bayesian estimators

23 Beta distribution The pdf of a beta distribution with parameters a and b is defined as f β (θ; a, b) := { θ a 1 (1 θ) b 1 β(a,b), if 0 θ 1, 0 otherwise β (a, b) := u a 1 (1 u) b 1 du u

24 Learning a Bernoulli distribution The first prior is beta with parameters a = 1 and b = 1 The second prior is beta with parameters a = 2 and b = 1 The posteriors are beta with parameters a = n 1 + 1, b = n and a = n 1 + 2, b = n respectively

25 Conjugate priors A conjugate family of distributions for a certain likelihood satisfies the following property: If the prior belongs to the family, the posterior also belongs to the family Beta distributions are conjugate priors when the likelihood is binomial

26 The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x)

27 The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x)

28 The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du

29 The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du θ a 1 (1 θ) b 1 ( ) n x θ x (1 θ) n x = u ua 1 (1 u) b 1 ( n x) u x (1 u) n x du

30 The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du θ a 1 (1 θ) b 1 ( ) n x θ x (1 θ) n x = u ua 1 (1 u) b 1 ( n x) u x (1 u) n x du θ x+a 1 (1 θ) n x+b 1 = u ux+a 1 (1 u) n x+b 1 du

31 The beta distribution is conjugate to the binomial likelihood Θ is beta with parameters a and b X is binomial with parameters n and Θ f Θ X (θ x) = f Θ (θ) p X Θ (x θ) p X (x) f Θ (θ) p X Θ (x θ) = u f Θ (u) p X Θ (x u) du θ a 1 (1 θ) b 1 ( ) n x θ x (1 θ) n x = u ua 1 (1 u) b 1 ( n x) u x (1 u) n x du θ x+a 1 (1 θ) n x+b 1 = u ux+a 1 (1 u) n x+b 1 du = f β (θ; x + a, n x + b)

32 Poll in New Mexico 449 participants, 227 people intend to vote for Clinton and 202 for Trump Probability that Trump wins in New Mexico? Assumptions: Fraction of Trump voters is modeled as a random variable Θ Poll participants are selected uniformly at random with replacement Number of Trump voters in the poll is binomial with parameters n = 449 and p = Θ

33 Poll in New Mexico Prior is uniform, so beta with parameters a = 1 and b = 1 Likelihood is binomial Posterior is beta with parameters a = and b = The probability that Trump wins in New Mexico is the probability that Θ given the data is greater than 0.5

34 Poll in New Mexico % 11.4%

35 Learning Bayesian models Conjugate priors Bayesian estimators

36 Bayesian estimators What estimator should be use? Two main options: The posterior mean The posterior mode

37 Posterior mean Mean of the posterior distribution θ MMSE ( x) := E ( Θ X = x ) Minimum mean-square-error (MMSE) estimate For any arbitrary estimator θ other ( x), ( ( E θ other ( X ) Θ ) ) ( 2 ( E θ MMSE ( X ) Θ ) ) 2

38 Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x

39 Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x ( ( = E θ other ( X ) θ MMSE ( X ) + θ MMSE ( X ) Θ ) 2 ) X = x

40 Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x ( ( = E θ other ( X ) θ MMSE ( X ) + θ MMSE ( X ) Θ ) 2 ) X = x ( ( = (θ other ( x) θ MMSE ( x)) 2 + E θ MMSE ( X ) Θ ) 2 ) X = x ( ( )) + 2 (θ other ( x) θ MMSE ( x)) E θ MMSE ( x) E Θ X = x

41 Posterior mean ( ( E θ other ( X ) Θ ) ) 2 X = x ( ( = E θ other ( X ) θ MMSE ( X ) + θ MMSE ( X ) Θ ) 2 ) X = x ( ( = (θ other ( x) θ MMSE ( x)) 2 + E θ MMSE ( X ) Θ ) 2 ) X = x ( ( )) + 2 (θ other ( x) θ MMSE ( x)) E θ MMSE ( x) E Θ X = x ( ( = (θ other ( x) θ MMSE ( x)) 2 + E θ MMSE ( X ) Θ ) 2 ) X = x

42 Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X

43 Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ( ) + E E θ MMSE ( X ) Θ ) 2 ) ) X

44 Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ( ) + E E θ MMSE ( X ) Θ ) 2 ) ) X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ) + E θ MMSE ( X ) Θ ) ) 2

45 Posterior mean By iterated expectation, ( ( E θ other ( X ) ) 2 ) Θ ( ( ( = E E θ other ( X ) Θ ) )) 2 X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ( ) + E E θ MMSE ( X ) Θ ) 2 ) ) X ( ( = E θ other ( X ) θ MMSE ( X ) ) ( 2 ( ) + E θ MMSE ( X ) Θ ) ) 2 ( ( E θ MMSE ( X ) Θ ) ) 2

46 Bernoulli distribution: n 0 = 1, n 1 =

47 Bernoulli distribution: n 0 = 3, n 1 =

48 Bernoulli distribution: n 0 = 91, n 1 = Posterior mean (uniform prior) Posterior mean (skewed prior) ML estimator

49 Posterior mode The maximum-a-posteriori (MAP) estimator is the mode of the posterior distribution ( ) θ MAP ( x) := arg max p Θ X θ x θ if Θ is discrete and if Θ is continuous ( ) θ MAP ( x) := arg max f Θ X θ x θ

50 Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x θ

51 Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du

52 Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ = arg max f X Θ ( x θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du )

53 Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ = arg max f X Θ ( x θ θ ( ) = arg max L x θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du )

54 Maximum-likelihood estimator If the prior is uniform the ML estimator coincides with the MAP estimator ( ) arg max f Θ X θ x = arg max θ θ = arg max f X Θ ( x θ θ ( ) = arg max L x θ θ ( ) f Θ θ f X Θ ( x θ ) u f Θ (u) f X Θ ( x u) du ) Uniform priors are only well defined over bounded domains

55 Probability of error If Θ is discrete, MAP estimator minimizes the probability of error For any arbitrary estimator θ other ( x) ( P θ other ( X ) Θ ) ( P θ MAP ( X ) Θ )

56 Probability of error ( P Θ = θ other ( X ) )

57 Probability of error ( P Θ = θ other ( X ) ( ) = f X ( x) P Θ = θ other ( x) ) X = x d x x

58 Probability of error ( P Θ = θ other ( X ) ) = x = x ( f X ( x) P Θ = θ other ( x) X ) = x d x f X ( x) p Θ X (θ other ( x) x) d x

59 Probability of error ( P Θ = θ other ( X ) ) = x = x x ( f X ( x) P Θ = θ other ( x) X ) = x d x f X ( x) p Θ X (θ other ( x) x) d x f X ( x) p Θ X (θ MAP ( x) x) d x

60 Probability of error ( P Θ = θ other ( X ) ) = x = x ( f X ( x) P Θ = θ other ( x) X ) = x d x f X ( x) p Θ X (θ other ( x) x) d x f X ( x) p Θ X (θ MAP ( x) x) d x x ( = P Θ = θ MAP ( X ) )

61 Sending bits Model for communication channel: signal Θ encodes a single bit Prior knowledge indicates that a 0 is 3 times more likely than a 1 p Θ (1) = 1 4, p Θ (0) = 3 4. The channel is noisy, so we send the signal n times At the receptor we observe X i = Θ + Z i, 1 i n, where Z is iid standard Gaussian

62 Sending bits: ML estimator The likelihood is equal to L x (θ) = The log-likelihood is equal to = n f Xi Θ ( x i θ) i=1 n i=1 1 e ( x i θ)2 2 2π n ( x i θ) 2 log L x (θ) = 2 i=1 n log 2π 2

63 Sending bits: ML estimator θ ML ( x) = 1 if log L x (1) = n i=1 n i=1 = log L x (0) x i 2 2 x i + 1 n log 2π 2 2 x i 2 2 n log 2π 2 Equivalently, θ ML ( x) = { 1 if 1 n n i=1 x i > otherwise

64 Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) )

65 Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) ) (Θ θ ML ( X ) ) Θ = 0 P (Θ = 0) + P = P (Θ θ ML ( X ) Θ = 1 ) P (Θ = 1)

66 Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) ) = P (Θ θ ML ( X ) ) Θ = 0 P (Θ = 0) + P (Θ θ ML ( X ) ) Θ = 1 P (Θ = 1) ( 1 n = P x i > 1 ) ( n 2 Θ = 0 1 n P (Θ = 0) + P x i < 1 ) n 2 Θ = 1 P (Θ = 1) i=1 i=1

67 Sending bits: ML estimator The probability of error is ( P Θ θ ML ( X ) ) = P (Θ θ ML ( X ) ) Θ = 0 P (Θ = 0) + P (Θ θ ML ( X ) ) Θ = 1 P (Θ = 1) ( 1 n = P x i > 1 ) ( n 2 Θ = 0 1 n P (Θ = 0) + P x i < 1 ) n 2 Θ = 1 P (Θ = 1) i=1 i=1 = Q ( n/2 )

68 Sending bits: MAP estimator The logarithm of the posterior is equal to log p Θ X (θ x)

69 Sending bits: MAP estimator The logarithm of the posterior is equal to n i=1 log p Θ X (θ x) = log f Xi Θ ( x i θ) p Θ (θ) f X ( x)

70 Sending bits: MAP estimator The logarithm of the posterior is equal to n i=1 log p Θ X (θ x) = log f Xi Θ ( x i θ) p Θ (θ) f X ( x) n = log f Xi Θ ( x i θ) p Θ (θ) log f X ( x) i=1

71 Sending bits: MAP estimator The logarithm of the posterior is equal to n i=1 log p Θ X (θ x) = log f Xi Θ ( x i θ) p Θ (θ) f X ( x) n = log f Xi Θ ( x i θ) p Θ (θ) log f X ( x) i=1 = n i=1 x i 2 2 x i θ + θ 2 n 2 2 log 2π + log p Θ (θ) log f X ( x)

72 Sending bits: MAP estimator θ MAP ( x) = 1 if log p Θ X (1 x) + log f X ( x) = n i=1 n i=1 x i 2 2 x i + 1 n log 2π log x i 2 2 n log 2π log 4 + log 3 2 = log p Θ X (0 x) + log f X ( x). Equivalently, θ MAP ( x) = { 1 if 1 n n i=1 x i > log 3 n, 0 otherwise.

73 Sending bits: MAP estimator The probability of error is P (Θ θ MAP ( x))

74 Sending bits: MAP estimator The probability of error is P (Θ θ MAP ( x)) = P (Θ θ MAP ( x) Θ = 0) P (Θ = 0) + P (Θ θ MAP ( x) Θ = 1) P (Θ = 1)

75 Sending bits: MAP estimator The probability of error is P (Θ θ MAP ( x)) = P (Θ θ MAP ( x) Θ = 0) P (Θ = 0) + P (Θ θ MAP ( x) Θ = 1) P (Θ = 1) ( 1 n = P x i > 1 n 2 + log 3 ) n Θ = 0 P (Θ = 0) i=1 ( 1 n + P x i < 1 n 2 + log 3 ) n Θ = 1 P (Θ = 1) i=1

76 Sending bits: MAP estimator The probability of error is P (Θ θ MAP ( x)) = P (Θ θ MAP ( x) Θ = 0) P (Θ = 0) + P (Θ θ MAP ( x) Θ = 1) P (Θ = 1) ( 1 n = P x i > 1 n 2 + log 3 ) n Θ = 0 P (Θ = 0) i=1 ( 1 n + P x i < 1 n 2 + log 3 ) n Θ = 1 P (Θ = 1) i=1 = 3 ( ) n/2 4 Q log ( ) n/2 n 4 Q log 3 n

77 Sending bits: Probability of error ML estimator MAP estimator Probability of error n

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian