Robust Bayesian Regression

Size: px

Start display at page:

Download "Robust Bayesian Regression"

Hugo Rich
5 years ago
Views:

1 Readings: Hoff Chapter 9, West JRSSB 1984, Fúquene, Pérez & Pericchi 2015 Duke University November 17, 2016

2 Body Fat Data: Intervals w/ All Data Response % Body Fat and Predictor Waist Circumference 95% confidence and prediction intervals for bodyfat.lm 95% confidence and prediction intervals for bodyfat.lm2 xbar xbar Bodyfat 30 observed fit conf int pred int Bodyfat observed fit conf int pred int xbar Abdomen xbar Abdomen Which analysis do we use? with Case 39 or not or something different?

3 Cook s Distance Leverage Standardized residuals Cook's distance Residuals vs Leverage

4 Options for Handling Influential Cases Are there scientific grounds for eliminating the case?

5 Options for Handling Influential Cases Are there scientific grounds for eliminating the case? Test if the case has a different mean than population

6 Options for Handling Influential Cases Are there scientific grounds for eliminating the case? Test if the case has a different mean than population Report results with and without the case

7 Options for Handling Influential Cases Are there scientific grounds for eliminating the case? Test if the case has a different mean than population Report results with and without the case Model Averaging to Account for Model Uncertainty?

8 Options for Handling Influential Cases Are there scientific grounds for eliminating the case? Test if the case has a different mean than population Report results with and without the case Model Averaging to Account for Model Uncertainty? Full model Y = Xβ + I n δ + ɛ

9 Options for Handling Influential Cases Are there scientific grounds for eliminating the case? Test if the case has a different mean than population Report results with and without the case Model Averaging to Account for Model Uncertainty? Full model Y = Xβ + I n δ + ɛ 2 n submodels γ i = 0 δ i = 0 If γ i = 1 then case i has a different mean mean shift outliers.

10 Mean Shift = Variance Inflation Model Y = Xβ + I n δ + ɛ Prior δ i γ i N(0, V σ 2 γ i ) γ i Ber(π) Then ɛ i given σ 2 is independent of δ i and ɛ i ɛ i + δ i σ 2 { N(0, σ 2 ) wp (1 π) N(0, σ 2 (1 + V )) wp π Model Y = Xβ + ɛ variance inflation V + 1 = K = 7 in the paper by Hoeting et al. package BMA

11 Simultaneous Outlier and Variable Selection MC3.REG(all.y = bodyfat$bodyfat, all.x = as.matrix(bodyfat$abdom num.its = 10000, outliers = TRUE) Model parameters: PI=0.02 K=7 nu=2.58 lambda=0.28 phi= models were selected Best 5 models (cumulative posterior probability = ): prob model 1 model 2 model 3 model 4 model 5 variables all.x 1 x x x x x outliers x x. x x x.. x post prob

12 Change Error Assumptions Y i ind t(ν, α + βx i, 1/φ)

13 Change Error Assumptions Y i L(α, β, φ) ind t(ν, α + βx i, 1/φ) n ( φ 1/2 1 + φ(y i α βx i ) 2 ν i=1 ) (ν+1) 2

14 Change Error Assumptions Y i L(α, β, φ) ind t(ν, α + βx i, 1/φ) n ( φ 1/2 1 + φ(y i α βx i ) 2 ν i=1 ) (ν+1) 2 Use Prior p(α, β, φ) 1/φ

15 Change Error Assumptions Y i L(α, β, φ) ind t(ν, α + βx i, 1/φ) n ( φ 1/2 1 + φ(y i α βx i ) 2 ν i=1 ) (ν+1) 2 Use Prior p(α, β, φ) 1/φ Posterior distribution p(α, β, φ Y ) φ n/2 1 n (1 + φ(y i α βx i ) 2 i=1 ν ) (ν+1) 2

16 Change Error Assumptions Y i L(α, β, φ) ind t(ν, α + βx i, 1/φ) n ( φ 1/2 1 + φ(y i α βx i ) 2 ν i=1 ) (ν+1) 2 Use Prior p(α, β, φ) 1/φ Posterior distribution p(α, β, φ Y ) φ n/2 1 n (1 + φ(y i α βx i ) 2 i=1 ν ) (ν+1) 2

17 Bounded Influence - West 1984 (and references within) Treat σ 2 as given, then influence of individual observations on the posterior distribution of β in the model where E[Y i ] = x T i β is investigated through the score function:

18 Bounded Influence - West 1984 (and references within) Treat σ 2 as given, then influence of individual observations on the posterior distribution of β in the model where E[Y i ] = x T i β is investigated through the score function: d d n log p(β Y) = dβ dβ log p(β) + xg(y i x T i β) i=1

19 Bounded Influence - West 1984 (and references within) Treat σ 2 as given, then influence of individual observations on the posterior distribution of β in the model where E[Y i ] = x T i β is investigated through the score function: where d d n log p(β Y) = dβ dβ log p(β) + xg(y i x T i β) g(ɛ) = d log p(ɛ) dɛ is the influence function of the error distribution (unimodal, continuous, differentiable, symmetric) i=1

20 Bounded Influence - West 1984 (and references within) Treat σ 2 as given, then influence of individual observations on the posterior distribution of β in the model where E[Y i ] = x T i β is investigated through the score function: where d d n log p(β Y) = dβ dβ log p(β) + xg(y i x T i β) g(ɛ) = d log p(ɛ) dɛ is the influence function of the error distribution (unimodal, continuous, differentiable, symmetric) i=1 An outlying observation y j is accommodated if the posterior distribution for p(β Y (i) ) converges to p(β Y) for all β as Y i. Requires error models with influence functions that go to zero such as the Student t (O Hagan, 1979)

21 Choice of df Score function for t with α degrees of freedom has turning points at ± α

22 Choice of df Score function for t with α degrees of freedom has turning points at ± α g(eps, 9) eps g (ɛ) is negative when ɛ 2 > α (standardized errors)

23 Choice of df Score function for t with α degrees of freedom has turning points at ± α g(eps, 9) eps g (ɛ) is negative when ɛ 2 > α (standardized errors) Contribution of observation to information matrix is negative and the observation is doubtful

24 Choice of df Score function for t with α degrees of freedom has turning points at ± α g(eps, 9) eps g (ɛ) is negative when ɛ 2 > α (standardized errors) Contribution of observation to information matrix is negative and the observation is doubtful Suggest taking α = 8 or α = 9 to reject errors larger than 8 or 3 sd.

25 Choice of df Score function for t with α degrees of freedom has turning points at ± α g(eps, 9) eps g (ɛ) is negative when ɛ 2 > α (standardized errors) Contribution of observation to information matrix is negative and the observation is doubtful Suggest taking α = 8 or α = 9 to reject errors larger than 8 or 3 sd.

26 Scale-Mixtures of Normal Representation Z i iid t(ν, 0, σ 2 )

27 Scale-Mixtures of Normal Representation Z i iid t(ν, 0, σ 2 ) Z i λ i ind N(0, σ 2 /λ i )

28 Scale-Mixtures of Normal Representation Z i iid t(ν, 0, σ 2 ) Z i λ i ind N(0, σ 2 /λ i ) λ i iid G(ν/2, ν/2)

29 Scale-Mixtures of Normal Representation Z i iid t(ν, 0, σ 2 ) Z i λ i ind N(0, σ 2 /λ i ) λ i iid G(ν/2, ν/2) Integrate out latent λ s to obtain marginal distribution.

30 Latent Variable Model Y i α, β, φ, λ ind 1 N(α + βx i, ) φλ i

31 Latent Variable Model Y i α, β, φ, λ ind 1 N(α + βx i, ) φλ i λ i iid G(ν/2, ν/2)

32 Latent Variable Model Y i α, β, φ, λ ind 1 N(α + βx i, ) φλ i λ i iid G(ν/2, ν/2) p(α, β, φ) 1/φ

33 Latent Variable Model Y i α, β, φ, λ Joint Posterior Distribution: ind 1 N(α + βx i, ) φλ i λ i iid G(ν/2, ν/2) p(α, β, φ) 1/φ

34 Latent Variable Model Y i α, β, φ, λ ind 1 N(α + βx i, ) φλ i λ i iid G(ν/2, ν/2) p(α, β, φ) 1/φ Joint Posterior Distribution: p((α, β, φ, λ 1,..., λ n Y ) { φ n/2 exp φ } λi (y i α βx i ) 2 2

35 Latent Variable Model Y i α, β, φ, λ ind 1 N(α + βx i, ) φλ i λ i iid G(ν/2, ν/2) p(α, β, φ) 1/φ Joint Posterior Distribution: p((α, β, φ, λ 1,..., λ n Y ) { φ n/2 exp φ } λi (y i α βx i ) 2 2 φ 1

36 Latent Variable Model Y i α, β, φ, λ ind 1 N(α + βx i, ) φλ i λ i iid G(ν/2, ν/2) p(α, β, φ) 1/φ Joint Posterior Distribution: p((α, β, φ, λ 1,..., λ n Y ) { φ n/2 exp φ } λi (y i α βx i ) 2 2 φ 1 n i=1 λ ν/2 1 i exp( λ i ν/2)

37 Programs BUGS: Bayesian inference Using Gibbs Sampling

38 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation

39 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation can be called from R with R2WinBUGS package

40 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation can be called from R with R2WinBUGS package can be run on any intel-based computer using VMware, wine

41 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation can be called from R with R2WinBUGS package can be run on any intel-based computer using VMware, wine OpenBUGS open source version of WinBUGS

42 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation can be called from R with R2WinBUGS package can be run on any intel-based computer using VMware, wine OpenBUGS open source version of WinBUGS LinBUGS is the Linux implementation of OpenBUGS.

43 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation can be called from R with R2WinBUGS package can be run on any intel-based computer using VMware, wine OpenBUGS open source version of WinBUGS LinBUGS is the Linux implementation of OpenBUGS. JAGS: Just Another Gibbs Sampler is an alternative program that uses the (almost) same model description as BUGS (Linux, MAC OS X, Windows) Can call from R using library(r2jags) or library(rjags)

44 Programs BUGS: Bayesian inference Using Gibbs Sampling WinBUGS is the Windows implementation can be called from R with R2WinBUGS package can be run on any intel-based computer using VMware, wine OpenBUGS open source version of WinBUGS LinBUGS is the Linux implementation of OpenBUGS. JAGS: Just Another Gibbs Sampler is an alternative program that uses the (almost) same model description as BUGS (Linux, MAC OS X, Windows) Can call from R using library(r2jags) or library(rjags) Include more than just Gibbs Sampling

45 JAGS Model

46 JAGS Model Data

47 JAGS Model Data Initial values (optional)

48 JAGS Model Data Initial values (optional) May do this through ordinary text files or use the functions in R2jags to specify model, data, and initial values then call jags.

49 Model Specification via R2jags rr.model = function() { for (i in 1:n) { mu[i] <- alpha0 + alpha1*(x[i] - Xbar) lambda[i] ~ dgamma(9/2, 9/2) prec[i] <- phi*lambda[i] Y[i] ~ dnorm(mu[i], prec[i]) } phi ~ dgamma(1.0e-6, 1.0E-6) alpha0 ~ dnorm(0, 1.0E-6) alpha1 ~ dnorm(0,1.0e-6) }

50 Notes on Models Distributions of stochastic nodes are specified using

51 Notes on Models Distributions of stochastic nodes are specified using Assignment of deterministic nodes uses <- (NOT =)

52 Notes on Models Distributions of stochastic nodes are specified using Assignment of deterministic nodes uses <- (NOT =) JAGS allows expressions as arguments in distributions (WinBUGS does not)

53 Notes on Models Distributions of stochastic nodes are specified using Assignment of deterministic nodes uses <- (NOT =) JAGS allows expressions as arguments in distributions (WinBUGS does not) Normal distributions are parameterized using precisions, so dnorm(0, 1.0E-6) is a N(0, )

54 Notes on Models Distributions of stochastic nodes are specified using Assignment of deterministic nodes uses <- (NOT =) JAGS allows expressions as arguments in distributions (WinBUGS does not) Normal distributions are parameterized using precisions, so dnorm(0, 1.0E-6) is a N(0, ) uses for loop structure as in R for model description but coded in C++ so is fast!

55 Data A list or rectangular data structure for all data and summaries of data used in the model bf.data = list(y = bodyfat$bodyfat, X=bodyfat$Abdomen) bf.data$n = length(bf.data$y) bf.data$xbar = mean(bf.data$x)

56 Specifying which Parameters to Save The parameters to be monitored and returned to R are specified with the variable parameters parameters = c("beta0", "beta1", "sigma", "mu34", "y34", "lambda[39]") All of the above (except lambda) are calculated from the other parameters. (See R-code for definitions of these parameters.)

57 Specifying which Parameters to Save The parameters to be monitored and returned to R are specified with the variable parameters parameters = c("beta0", "beta1", "sigma", "mu34", "y34", "lambda[39]") All of the above (except lambda) are calculated from the other parameters. (See R-code for definitions of these parameters.) lambda[39] saves only the 39th case of λ

58 Specifying which Parameters to Save The parameters to be monitored and returned to R are specified with the variable parameters parameters = c("beta0", "beta1", "sigma", "mu34", "y34", "lambda[39]") All of the above (except lambda) are calculated from the other parameters. (See R-code for definitions of these parameters.) lambda[39] saves only the 39th case of λ To save a whole vector (for example all lambdas, just give the vector name)

59 Specifying which Parameters to Save The parameters to be monitored and returned to R are specified with the variable parameters parameters = c("beta0", "beta1", "sigma", "mu34", "y34", "lambda[39]") All of the above (except lambda) are calculated from the other parameters. (See R-code for definitions of these parameters.) lambda[39] saves only the 39th case of λ To save a whole vector (for example all lambdas, just give the vector name)

60 Running jags from R bf.sim = jags(bf.data, inits=null, par=parameters, model=rr.model, n.chains=2, n.iter=5000, )

61 Output mean sd 2.5% 50% 97.5% beta beta sigma mu y lambda[39] % HPD interval for expected bodyfat (14.5, 15.8) 95% HPD interval for bodyfat (5.1, 25.3)

62 Comparison 95% Probability Interval for β is (0.60, 0.71) with t 9 errors

63 Comparison 95% Probability Interval for β is (0.60, 0.71) with t 9 errors 95% Confidence Interval for β is (0.58, 0.69) (all data normal model)

64 Comparison 95% Probability Interval for β is (0.60, 0.71) with t 9 errors 95% Confidence Interval for β is (0.58, 0.69) (all data normal model) 95% Confidence Interval for β is (0.61, 0.73) ( normal model without case 39)

65 Comparison 95% Probability Interval for β is (0.60, 0.71) with t 9 errors 95% Confidence Interval for β is (0.58, 0.69) (all data normal model) 95% Confidence Interval for β is (0.61, 0.73) ( normal model without case 39) Results intermediate without having to remove any observations

66 Comparison 95% Probability Interval for β is (0.60, 0.71) with t 9 errors 95% Confidence Interval for β is (0.58, 0.69) (all data normal model) 95% Confidence Interval for β is (0.61, 0.73) ( normal model without case 39) Results intermediate without having to remove any observations Case 39 down weighted by λ 39

67 Full Conditional for λ j p(λ j rest, Y ) p(α, β, φ, λ 1,..., λ n Y )

68 Full Conditional for λ j p(λ j rest, Y ) p(α, β, φ, λ 1,..., λ n Y ) n φ n/2 1 exp { φ2 } λ i(y i α βx i ) 2 n i=1 λ ν i i=1 exp( λ i ν 2 )

69 Full Conditional for λ j p(λ j rest, Y ) p(α, β, φ, λ 1,..., λ n Y ) n φ n/2 1 exp { φ2 } λ i(y i α βx i ) 2 n i=1 λ ν i i=1 exp( λ i ν 2 ) Ignore all terms except those that involve λ j

70 Full Conditional for λ j p(λ j rest, Y ) p(α, β, φ, λ 1,..., λ n Y ) n φ n/2 1 exp { φ2 } λ i(y i α βx i ) 2 n i=1 λ ν i i=1 exp( λ i ν 2 ) Ignore all terms except those that involve λ j λ j rest, Y G ( ν + 1 2, φ(y j α βx j ) 2 ) + ν 2

71 Weights Under prior E[λ i ] = 1

72 Weights Under prior E[λ i ] = 1 Under posterior, large residuals are down-weighted (approximately those bigger than ν) Posterior Distribution Density λ 39

73 Prior Distributions on Parameter As a general recommendation, the prior distribution should have heavier tails than the likelihood

74 Prior Distributions on Parameter As a general recommendation, the prior distribution should have heavier tails than the likelihood with t 9 errors use a t α with α < 9

75 Prior Distributions on Parameter As a general recommendation, the prior distribution should have heavier tails than the likelihood with t 9 errors use a t α with α < 9 also represent via scale mixture of normals

76 Prior Distributions on Parameter As a general recommendation, the prior distribution should have heavier tails than the likelihood with t 9 errors use a t α with α < 9 also represent via scale mixture of normals Horseshoe, Double Pareto, Cauchy all have heavier tails

77 Prior Distributions on Parameter As a general recommendation, the prior distribution should have heavier tails than the likelihood with t 9 errors use a t α with α < 9 also represent via scale mixture of normals Horseshoe, Double Pareto, Cauchy all have heavier tails See Stack-loss code

78 Prior Distributions on Parameter As a general recommendation, the prior distribution should have heavier tails than the likelihood with t 9 errors use a t α with α < 9 also represent via scale mixture of normals Horseshoe, Double Pareto, Cauchy all have heavier tails See Stack-loss code

Robust Bayesian Simple Linear Regression

Robust Bayesian Simple Linear Regression October 1, 2008 Readings: GIll 4 Robust Bayesian Simple Linear Regression p.1/11 Body Fat Data: Intervals w/ All Data 95% confidence and prediction intervals for