Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Size: px

Start display at page:

Download "Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples"

John Stone
5 years ago
Views:

1 Bayesian inference for sample surveys Roderick Little Module : Bayesian models for simple random samples

2 Superpopulation Modeling: Estimating parameters Various principles: least squares, method of moments, maximum likelihood Sketch main ideas of maximum likelihood, an important approach that underlies statistical inferences for many common models: Linear and nonlinear regression Generalized linear models (logistic, Poission regression) Repeated measures models (SAS PROC MIXED, NLMIXED Survival analysis proportional hazards models

3 Likelihood methods Statistical model + data Likelihood Two general approaches based on likelihood maximum likelihood inference (for large samples) Bayesian inference (better for small samples): log(likelihood) + log(prior) = log(posterior) Methods do not require rectangular data sets can be applied to incomplete data 3

4 Definition of Likelihood Data Y Statistical model yields probability density for Y with unknown parameters Likelihood function is then a function of L( Y ) const f ( Y ) f( Y ) Loglikelihood is often easier to work with: ( Y ) log L( Y ) const log{ f ( Y )} Constants can depend on data but not on parameter 4

5 Example: Normal sample univariate iid normal sample Y ( y1,..., y n ) (, ) n/ n (, ) exp yi i1 f Y 1 n (, ) ln n Y i i 1 1 y 5

6 Example: Multinomial sample Y ( y1,..., y n ) univariate K-category multinomial sample n j y i = number of equal to j (j=1,,k) (,..., ); K 1 K 1 K1 K 1 n! n j f( Y 1,..., K 1) j ( K 1) n1!... nk! j1 K 1 ( 1,..., K 1 Y) nj log jnk log( K 1) j1 n K 6

7 Maximum Likelihood Estimate The maximum likelihood (ML) estimate ˆ of maximizes the likelihood, or equivalently the loglikelihood L( ˆ Y) L( Y) for all The ML estimate is the value of the parameter that makes the data most likely The ML estimate is not necessarily unique, but is for many regular problems given enough data 7

8 Computing the ML estimate In regular problems, the ML estimate can be found by solving the likelihood equation S( Y) 0 where S is the score function, defined as the first derivative of the loglikelihood: S( Y) log L( Y) For some models (e.g. multiple linear regression), likelihood equation has an explicit solution; for others (e.g. logistic regression) numerical optimization methods are needed 8

9 Normal Examples Univariate Normal sample Y ( y1,..., y n ) (, ) n n 1 ˆ y 1 y ˆ i ( yi y) n n i1 (Note the lack of a correction for degrees of freedom) Multivariate Normal sample T y yi y yi y n i1 Normal Linear Regression (possibly weighted) i1 n ˆ 1 ˆ, ( )( ) p ( yi xi1,..., xip) ~ N( 0 jxij, / ui) j1 ˆ ( ˆ, ˆ,..., ˆ ) weighted least squares estimates 0 1 p ˆ (weighted residual sum of squares)/n 9

10 Multinomial Example Y ( y,..., y ); y ~ MNOM(,..., ) 1 n i 1 K n number of y equal to j ( j 1,..., K) j i Likelihood Equations: l n j nk 0, j 1,..., K 1 j j K 1 Hence ML estimate is ˆ n / n, j 1,..., K j j 10

11 Logistic regression exp f ( ) i Pr( yi 1 xi 1,..., xip) i( ) 1 exp f ( ) f ( ) x i 0 j ij j1 n ( ) y ( ) (1 y )(1 ( ) i1 p i i i i ML estimation requires iterative methods like method of scoring i 11

12 ML for mixed-effects models y ( y, y ) : k-dimensional vector of repeated measures i obs, i mis, i ( yi Xi, i)~ Nk( X1 i Xi, ) are fixed effects; are random effects: ~ N (0, ) Missing Data Mechanism: missing at random ML requires iterative algorithms e.g. Harville (1977), Laird and Ware (198), SAS Proc Mixed Very flexible mean and covariance structures Normality not a major assumption if N large, and recent programs allow for non-normal outcomes i q 1

13 Properties of ML estimates Under assumed model, ML estimate is: Consistent (not necessarily unbiased) Efficient for large samples not necessarily the best for small samples ML estimate is transformation invariant If is the ML estimate of Then () is the ML estimate of ( ) 13

14 Large-sample ML Inference Basic large-sample approximation: for regular problems, ˆ ~ N(0, C) where C is a covariance matrix estimated from the sample Frequentist treats as random, as fixed; equation defines the sampling distribution of Bayesian treats as random, as fixed; equation defines posterior distribution of 14

15 Forms of precision matrix The precision of the ML estimate is measured by Some forms for this are: Observed information (recommended) L Y C I Y 1 log ( ) ( ) Expected information (not as good, may be simpler) C 1 1 C J() E I( Y, ) Some other approximation to curvature of loglikelihood in the neighborhood of the ML estimate 15

16 Interval estimation 95% (confidence, probability) interval for scalar is:,where 1.96 is 97.5 pctile of normal distribution Example: univariate normal sample L n I J N M O / 0 n Q P 4 0 /( ) Hence some 95% intervals are: / L N M O C Q P 4 n 0 0 / n 16

17 Significance Tests Tests based on likelihood ratio (LR) or Wald (W) statistics: ( (1), () ); (1)0 null value of (1) ; = other parameters unrestricted ML estimate (, ); ML estimate of given (1)0 () () () (1) (1)0 LR statistic: LR( ˆ, ) ( ˆ Y) ( Y) 1 Wald statistic: ˆ ˆ T W(, ) ( ) C ( ˆ ) (1)0 (1) (11) (1)0 (1) C(11) covariance matrix of ( (1) ˆ (1) ) yield P-values P pr ˆ q D(, ) D = LR or Wald statistic; q = dimension of q = Chi-squared distribution with q degrees of freedom 0 17

18 Bayesian Modeling Bayesian model adds a prior distribution for the parameters: ( Y, ) ~ ( Z) f( Y Z, ), ( Z) prior distribution Inference about is based on posterior distribution from Bayes Theorem: p( Z, Yinc) ( Z) L( Z, Yinc ), L = likelihood Inference about finite population quantitity QY ( ) based on pqy ( ( ) Yinc) posterior predictive distribution I Z Y Y inc of Q given sample values Y inc p( QY ( ) ZY, inc) pqy ( ( ) ZY, inc, ) p( ZY, inc) d (Integrates out nuisance parameters ) In the super-population modeling approach, parameters are considered fixed and estimated In the Bayesian approach, parameters are random and assigned prior distributions leads to better small-sample inference Yˆ exc

19 Bayes Inference Inferences about Q are based on its posterior predictive distribution: estimate is posterior mean: q E( QY inc ) standard error is posterior sd: s Var( Q Y inc ) 95% posterior probability (or credibility) interval plays role of confidence interval (but with simpler interpretation) In large samples, a 95% interval is qˆ 1.96s In small samples, can use highest posterior density (hpd) interval, or.5 th to 97.5 th percentiles of posterior distribution (often simulated using MCMC draws from the posterior distribution) 19

20 Inference about population quantities Inferences about Q are conveniently obtained by first conditioning on and then averaging over posterior of. In particular, the posterior mean is: EQY ( inc) EEQY ( inc, ) Yinc and the posterior variance is: Var( Q Y ) E Var( Q Y, ) Y Var E( Q Y, ) Y inc inc inc inc inc Value of this technique will become clear in applications Finite population corrections are automatically obtained as differences in the posterior variances of Q and Inferences based on full posterior distribution useful in small samples (e.g. provides t corrections ) 0

21 Example: linear regression The normal linear regression model: p ( yi xi1,..., xip) ~ N( 0 jxij, ) j1 with non-informative Jeffreys prior: p ( 0,..., p,log ) const. yields the posterior distribution of ( as multivariate 0,..., p) T with mean given by the least squares estimates ( ˆ 0,..., ˆ p ) T 1 covariance matrix ( X X) s, where X is the design matrix, and degrees of freedom n - p - 1. Resulting posterior probability intervals are equivalent to standard t confidence intervals. 1

22 Simulating Draws from Posterior Distribution With problems with high-dimensional, it is often easier to draw values from the posterior distribution, and base inferences on these draws For example, if ( d ) ( 1 : d 1,..., D) is a set of draws from the posterior distribution for a scalar parameter 1, then 1 D ( d ) D approximates posterior mean s 1 d D ( d ) D d ( 1) ( ) approximates posterior variance ( s ) or.5th to 97.5th percentiles of draws approximates 95% posterior credibility interval for

23 Example: Posterior Draws for Normal Linear Regression ˆ s (, ) ls estimates of slopes and resid variance ( d ) np1 ( d) T ( d) np1 ( n p1) s / ˆ Az = chi-squared deviate with n p1 df T z ( z,..., z ), z ~ N(0,1) 1 p1 i T 1 A upper triangular Cholesky factor of ( X X) : T AA T ( X X) 1 Easily extends to weighted regression: see Example

24 Consulting Example In India, any person possessing a radio, transistor or television has to pay a license fee. In a densely populated area with mostly makeshift houses practically no one was paying these fees. It was determined that for enforcement to be fiscally meaningful, the proportion of households possessing one or more of these devices must exceed certain limit. 4

25 Consulting example (continued) N Y i Population Size 1, if household i has a device 0, otherwise N Q Y / N Proportion of households with a device i1 i Question of Interest: Pr( Q 0.3) If the probability of Q exceeding 0.3 is very high then enforcement might be fiscally sensible Conduct a small scale survey to answer the question of interest Note that question only makes sense under Bayes paradigm 5

26 i1 Consulting example srs of size n, Y { Y,..., Y }, Y { Y,..., Y } Y i x ~ iid Bernoulli( ) n Y i f( x ) ( ) (1 ) inc 1 n exc n1 N n x nx x ( ) 1 (0,1) N N QYi / N xyi/ N i1 in1 Model for observable Prior distribution Estimand 6

27 The posterior distribution is Binomial Example f ( x ) ( ) p( x) f( x ) ( ) f( x ) ( ) d p( x) n x nx ( ) (1 ) 1 x n x nx ( ) (1 ) d x x~ Beta( x1, nx1) 7

28 For N, Infinite Population N Pr( Y 0.3 x) Pr( 0.3 x) N Y Compute using cumulative distribution function of a beta distribution which is a standard function in most software such as SAS, R What is the maximum proportion of households in the population with devices that can be said with great certainty? Pr(? x) 0.9 Inverse CDF of Beta Distribution 8

29 Point Estimates Point estimate is often used as a single summary best value for the unknown Q Some choices are the mean, mode or the median of the posterior distribution of Q For symmetrical distributions an intuitive choice is the center of symmetry For asymmetrical distributions the choice is not clear. It depends upon the loss function. 9

30 Interval Estimation Better summary is an interval estimate Fix the coverage rate 1- in advance and determine the highest posterior density region C to include most likely values of Q totaling 1- posterior probability Fix the value Q o in advance, determine C by the collection of values of Q more likely than Q o and calculate the coverage 1- as the posterior probability of this C 30

31 Interval Estimates C is such that p Q Y p Q Y ' (1) ( inc) ( inc) ' QC Q C, () Pr( QC Y ) 1 inc Most likely is usually defined by highest posterior density Highest Posterior Density Region For symmetric unimodal posterior distributions, HPD interval is (q,q )where Pr(Q q In the Binomial example, the beta density of used to determine the interval estimate of Q 31

32 Normal simple random sample Yi ~ iid N(, ); i 1,,..., N (, ) simple random sample results in (,..., ) ny ( N n) Y QY N f y(1 f) Y exc exc Derive posterior distribution of Q Yinc y1 y n 3

33 Normal Example Posterior distribution of ( ) p(, Y ) ( ) exp ( y ) n/ i inc iinc 1 ( ) exp ( y y) / n( y) / n/1 i iinc 1 The above expressions imply that (1) Y ~ ( y y) / inc i n1 iinc () Yinc, ~ N( y, / n) 33

34 Posterior Distribution of Q Y exc, ~ (, ) N N n Yexc, Yinc ~ Ny, N n n (1 f ) n Q f y(1 f) Y exc (1 f ) Q, Yinc ~ N y, n s Yexc Yinc ~ tn 1 y, (1 f ) n (1 f ) s Q Yinc ~ tn 1 y, n 34

35 HPD Interval for Q Note the posterior t distribution of Q is symmetric and unimodal -- values in the center of the distribution are more likely than those in the tails. Thus a (1-)100 HPD interval is: y t n1,1 / (1 f) s n Like frequentist confidence interval, but recovers the t correction 35

36 Some other Estimands Suppose Q=Median or some other percentile One is better off inferring about all non-sampled values As we will see later, simulating values of Y exc add enormous flexibility for drawing inferences about any finite population quantity Modern Bayesian methods heavily rely on simulating values from the posterior distribution of the model parameters and predictive-posterior distribution of the nonsampled values Computationally, if the population size, N, is too large then choose any arbitrary value K large relative to n, the sample size National sample of size 000 US population size 306 million For numerical approximation, we can choose K=000/f, for some small f=0.01 or

37 Comparison of Two Populations Population 1 Population Population size N Sample size Y ind N(, ) 1i 1 1 (, ) n 1 1 Population size N Sample size Y ind N(, ) i (, ) n Sample Statistics : ( y, s ) n i Posterior distributions : ( n 1) s / ~ ~ N( y, / n ) Y ~ N(, ), iexc Sample Statistics : ( y, s ) n 1 i Posterior distributions : ( n 1) s / ~ ~ N( y, / n ) Y ~ N(, ), iexc 37

38 Examples Estimands Y1 Y (Finite sample version of Behrens-Fisher Problem) Difference Pr( Y1 c) Pr( Y c) Difference in the population medians Ratio of the means or medians Ratio of Variances It is possible to analytically compute the posterior distribution of some these quantities It is a whole lot easier to simulate values of nonsampled Y1 in Population 1 and Y in Population ' s ' s 38

39 Bayesian Nonparametric Inference Population: Y1, Y, Y3,..., YN All possible distinct values: Model: Pr( Yi dk) k 1 Prior: 1 Mean and Variance: d, d,..., 1 (,,..., ) if 1 EY ( ) d i k k k k k k k k Var( Y ) d i k k k d K 39

40 Bayesian Nonparametric Inference (continued) SRS of size n with n k equal to number of d k in the sample Objective is to draw inference about the population mean: Q f y(1 f) Y As before we need the posterior distribution of and exc 40

41 Nonparametric Inference (continued) Posterior distribution of is Dirichlet: Y n n ( n 1 inc) k if 1 and k k k k k k Posterior mean, variance and covariance of nk nk( n nk) E( k Yinc), Var( k Yinc) n n ( n1) Cov(, Y ) k l inc n nn k l ( n1) 41

42 nk E( Yinc) dk y n E Yinc s n k Inference for Q s n1 1 Var( Yinc) ; s ( y y) nn1 1 ( ) n 1 1 i n iinc Hence posterior mean and variance of Q are: EQY ( ) fy(1 f) E( Y ) y inc inc Var( Q Y ) (1 f ) inc s n1 nn1 4

43 Ratio and Regression Estimates Population: (y i,x i ; i=1,, N) Sample: (y i, iinc, x i, i=1,,,n). For now assume SRS Objective: Infer about the population mean Q N i1 Excluded Y s are missing values y i... y y 1 y n x x... x x x... x 1 n n1 n N 43

44 Model Specification g ( Yi xi,, ) ~ ind N( xi, xi ) i1,,..., N g known Prior distribution: (, ) g=1/: Classical Ratio estimator. Posterior variance equals randomization variance for large samples g=0: Regression through origin. The posterior variance is nearly the same as the randomization variance. g=1: HT model. Posterior variance equals randomization variance for large samples. Note that, no asymptotic arguments have been used in deriving Bayesian inferences. Makes small sample corrections and uses t- distributions. 44

45 Some Remarks For large samples, estimate and its variance under nonparametric model assumptions are very nearly the same as those under the normal model assumptions For large N, the population size, the finite population quantity is very nearly same as the model parameter (Q ). For large samples, Q E Q Y ( inc) ~ (0,1) Var( Q Y ) inc N 45

46 Remarks (Continued) Bayesian Interpretation: Summary of the excluded portion of the population has approximate normal distribution conditional on the observed data. That is Y inc is fixed and Q is random. Frequentist Interpretation: Under repeated sampling, the distribution of estimates of Q. That is Q is fixed and Y inc is random. For large samples, the frequentist and Bayes will nearly give the same numerical answers but interpretations would differ. 46

47 Remarks In much practical analysis the prior information is diffuse, and the likelihood dominates the prior information. Jeffreys (1961) developed noninformative priors based on the notion of very little prior information relative to the information provided by the data. Jeffreys derived the noninformative prior requiring invariance under parameter transformation. In general, 1/ ( ) J ( ) where log f( y ) J( ) E t 47

48 Examples of noninformative priors Normal: (, ) Binomial: ( ) (1 ) Poisson: ( ) 1/ 1/ 1/ Normal regression with slopes : (, ) In simple cases these noninformative priors result in numerically same answers as standard frequentist procedures 48

49 Comments iid (normal) model for simple random sampling is based on exchangeability ideas of De Finetti other off-the-shelf models are more appropriate for other sample designs -- hence the design influences the choice of model, as we shall see Even in this simple normal problem, Bayes is useful: t-inference is recovered for small samples by putting a prior on the unknown variance Bayes is even more attractive for more complex problems, as discussed later. 49

50 Summary Considered Bayesian predictive inference for population quantities Focused here on the population mean, but other posterior distribution of more complex finite population quantities Q can be derived Key is to compute the posterior distribution of Q conditional on the data and model Summarize the posterior distribution using posterior mean, variance, HPD interval etc Modern Bayesian analysis uses simulation technique to study the posterior distribution Models need to incorporate complex design features like unequal selection, stratification and clustering 50

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random