arxiv: v4 [math.st] 3 Jun 2016

Similar documents
4. Score normalization technical details We now discuss the technical details of the score normalization method.

Tests for Two Proportions in a Stratified Design (Cochran/Mantel-Haenszel Test)

arxiv: v1 [physics.data-an] 26 Oct 2012

Bayesian Spatially Varying Coefficient Models in the Presence of Collinearity

Hotelling s Two- Sample T 2

Estimation of the large covariance matrix with two-step monotone missing data

SAS for Bayesian Mediation Analysis

General Linear Model Introduction, Classes of Linear models and Estimation

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Notes on Instrumental Variables Methods

Distributed Rule-Based Inference in the Presence of Redundant Information

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

Research Note REGRESSION ANALYSIS IN MARKOV CHAIN * A. Y. ALAMUTI AND M. R. MESHKANI **

Finite Mixture EFA in Mplus

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

Collaborative Place Models Supplement 1

Sampling. Inferential statistics draws probabilistic conclusions about populations on the basis of sample statistics

Estimating Time-Series Models

Estimating function analysis for a class of Tweedie regression models

arxiv: v3 [physics.data-an] 23 May 2011

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

CHAPTER-II Control Charts for Fraction Nonconforming using m-of-m Runs Rules

arxiv: v2 [stat.me] 3 Nov 2014

Statics and dynamics: some elementary concepts

Asymptotically Optimal Simulation Allocation under Dependent Sampling

CHAPTER 5 STATISTICAL INFERENCE. 1.0 Hypothesis Testing. 2.0 Decision Errors. 3.0 How a Hypothesis is Tested. 4.0 Test for Goodness of Fit

CONVOLVED SUBSAMPLING ESTIMATION WITH APPLICATIONS TO BLOCK BOOTSTRAP

Statistics II Logistic Regression. So far... Two-way repeated measures ANOVA: an example. RM-ANOVA example: the data after log transform

Named Entity Recognition using Maximum Entropy Model SEEM5680

State Estimation with ARMarkov Models

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Scaling Multiple Point Statistics for Non-Stationary Geostatistical Modeling

Paper C Exact Volume Balance Versus Exact Mass Balance in Compositional Reservoir Simulation

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

Maximum Entropy and the Stress Distribution in Soft Disk Packings Above Jamming

Empirical Bayesian EM-based Motion Segmentation

Chapter 7 Sampling and Sampling Distributions. Introduction. Selecting a Sample. Introduction. Sampling from a Finite Population

Information collection on a graph

Learning Sequence Motif Models Using Gibbs Sampling

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Yixi Shi. Jose Blanchet. IEOR Department Columbia University New York, NY 10027, USA. IEOR Department Columbia University New York, NY 10027, USA

Spectral Analysis by Stationary Time Series Modeling

An Improved Generalized Estimation Procedure of Current Population Mean in Two-Occasion Successive Sampling

MULTIVARIATE STATISTICAL PROCESS OF HOTELLING S T CONTROL CHARTS PROCEDURES WITH INDUSTRIAL APPLICATION

Bayesian Model Averaging Kriging Jize Zhang and Alexandros Taflanidis

Developing A Deterioration Probabilistic Model for Rail Wear

Bayesian Networks Practice

The Poisson Regression Model

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014

A Qualitative Event-based Approach to Multiple Fault Diagnosis in Continuous Systems using Structural Model Decomposition

Supplemental Information

Bayesian Estimation Under Informative Sampling with Unattenuated Dependence

COMMUNICATION BETWEEN SHAREHOLDERS 1

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

MULTIVARIATE SHEWHART QUALITY CONTROL FOR STANDARD DEVIATION

Unobservable Selection and Coefficient Stability: Theory and Evidence

Information collection on a graph

Covariance Matrix Estimation for Reinforcement Learning

Radial Basis Function Networks: Algorithms

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Brownian Motion and Random Prime Factorization

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

Estimation of Separable Representations in Psychophysical Experiments

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

On split sample and randomized confidence intervals for binomial proportions

Outline. Markov Chains and Markov Models. Outline. Markov Chains. Markov Chains Definitions Huizhen Yu

Deriving Indicator Direct and Cross Variograms from a Normal Scores Variogram Model (bigaus-full) David F. Machuca Mory and Clayton V.

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

Modelling 'Animal Spirits' and Network Effects in Macroeconomics and Financial Markets Thomas Lux

Flexible Tweedie regression models for continuous data

Modeling Business Cycles with Markov Switching Arma (Ms-Arma) Model: An Application on Iranian Business Cycles

Hidden Predictors: A Factor Analysis Primer

VIBRATION ANALYSIS OF BEAMS WITH MULTIPLE CONSTRAINED LAYER DAMPING PATCHES

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Convex Optimization methods for Computing Channel Capacity

University of Michigan School of Public Health

Recovering preferences in the household production framework: The case of averting behavior

Asymptotic F Test in a GMM Framework with Cross Sectional Dependence

Universal Finite Memory Coding of Binary Sequences

A New Asymmetric Interaction Ridge (AIR) Regression Method

The Binomial Approach for Probability of Detection

Adaptive Estimation of the Regression Discontinuity Model

A multiple testing approach to the regularisation of large sample correlation matrices

Estimating Posterior Ratio for Classification: Transfer Learning from Probabilistic Perspective

Michael Lechner. Swiss Institute for Empirical Economic Research (SEW), University of St. Gallen

Chapter 3. GMM: Selected Topics

Characteristics of Beam-Based Flexure Modules

Linear diophantine equations for discrete tomography

Plotting the Wilson distribution

Elementary Analysis in Q p

An Analysis of Reliable Classifiers through ROC Isometrics

A PEAK FACTOR FOR PREDICTING NON-GAUSSIAN PEAK RESULTANT RESPONSE OF WIND-EXCITED TALL BUILDINGS

On-Line Appendix. Matching on the Estimated Propensity Score (Abadie and Imbens, 2015)

Effective conductivity in a lattice model for binary disordered media with complex distributions of grain sizes

Adaptive estimation with change detection for streaming data

New Schedulability Test Conditions for Non-preemptive Scheduling on Multiprocessor Platforms

The non-stochastic multi-armed bandit problem

Transcription:

Electronic Journal of Statistics ISSN: 1935-7524 arxiv: math.pr/0000000 Bayesian Estimation Under Informative Samling Terrance D. Savitsky and Daniell Toth 2 Massachusetts Ave. N.E, Washington, D.C. 20212 USA e-mail: Savitsky.Terrance@bls.gov e-mail: Toth.Daniell@bls.gov arxiv:1507.07050v4 [math.st] 3 Jun 2016 1. Introduction Abstract: Bayesian analysis is increasingly oular for use in social science and other alication areas where the data are observations from an informative samle. An informative samling design leads to inclusion robabilities that are correlated with the resonse variable of interest. Model inference erformed on the observed samle taken from the oulation will be biased for the oulation generative model under informative samling since the balance of information in the samle data is different from that for the oulation. Tyical aroaches to account for an informative samling design under Bayesian estimation are often difficult to imlement because they require re-arameterization of the hyothesized generating model, or focus on design, rather than model-based, inference. We roose to construct a seudo-osterior distribution that utilizes samling weights based on the marginal inclusion robabilities to exonentiate the likelihood contribution of each samled unit, which weights the information in the samle back to the oulation. Our aroach rovides a nearly automated estimation rocedure alicable to any model secified by the data analyst for the oulation and retains the oulation model arameterization and osterior samling geometry. We construct conditions on known marginal and airwise inclusion robabilities that define a class of samling designs where L 1 consistency of the seudo osterior is guaranteed. We demonstrate our method on an alication concerning the Bureau of Labor Statistics Job Oenings and Labor Turnover Survey. Keywords and hrases: Survey samling, Gaussian rocess, Dirichlet rocess, Bayesian hierarchical models, Latent models, Markov Chain Monte Carlo. Bayesian formulations are increasingly oular for modeling hyothesized distributions with comlicated deendence structures. Their oularity stems from the ease of caturing this deendence by emloying models with random effects arameters with a hierarchical construction that regulates the borrowing of information for estimation. Latent arameters are often used in the model to ermit flexibility in the estimation of the deendencies among the observations Dunson 2010. In social science alications, utilization of latent arameters may be useful for making inference about intrinsic belief states of eole from their observed actionssee for examle, Savitsky & Dalal 2013 Other alication areas in which latent arameters may be emloyed include, engineering and natural science, which use them to arameterize elements of an evolving rocess. Data used in these tye of alications are often acquired through a comlex samle design, resulting in robabilities of inclusion that are associated with the variable of interest. This association could result in an observed data set consisting of units that are not indeendent and identically distributed. A samling design that roduces a correlation between selection robabilities and observed values is referred to as informative. Failure to account for this deendence caused by the samling design could bias estimation of arameters that index the joint distribution hyothesized to have generated the oulation Holt et al. 1980. 1.1. Examles We next outline some examles of survey instruments that emloy informative samling designs and associated inferential goals for models estimated on observed samles realized from these surveys. Examle 1: The Survey of Occuational Illnesses and Injuries SOII is administered to U.S. business establishments by the U.S. Bureau of Labor Statistics BLS, in artnershi with individual states, in order to cature worklace induced injuries and illnesses. A stratified samling design is used where strata are indexed by state-industry-size-injury rate. Strata containing establishments that historically exress higher injury rates are assigned higher samle inclusion robabilities. The resulting samle will contain a larger roortion of establishments that exress higher injury rates than the oulation, as a whole. States desire to erform regression U.S. Bureau of Labor Statistics, 2 Massachusetts Ave. N.E, Washington, D.C. 20212 USA 1

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 2 modeling with variable selection to discover the root causes that redict illnesses and injuries among the oulation of establishments, estimated from the observed samle. The model-estimated coefficients from the samle will be biased absent correction for over-reresentation of establishments that tend to exress relatively high injury rates. Examle 2: The Current Establishment Statistics CES is a BLS survey of U.S. business establishments that collects emloyment count data across states and industries under a stratified samling design with strata indexed by the number of emloyees in each establishment. Strata containing relatively larger establishments are assigned higher inclusion robabilities than those which hold establishments with relatively fewer emloyees. The distribution of emloyment totals in the observed samle of establishments will be skewed towards relatively larger values as comared to the oulation of establishments. An imortant area of modeling inference is to understand industry-indexed differences in monthly emloyment trends and correlations among industries in the oulation. We would use a mixed effects model, arameterized with random effects indexed by industry and month. Estimation of the oulation distribution under our model from the observed samle will be biased absent some correction for the skewness in the samle towards larger-sized establishments. Examle 3: BLS collects establishment-indexed emloyment totals in both the Quarterly Census of Emloyment and Wages QCEW and the CES survey. CES survey articiants also rovide submissions to the QCEW, such that their reorted monthly emloyment totals for an overlaing time eriod of interest should be equal between the two instruments, but they are not for aroximately 10000 establishments, indicating one or more emloyment count submission errors for those resondents. A resonse variable of interest, termed the error time series, was created by taking the absolute value of the difference in reorted emloyment totals among the 10000 establishments for each month over a 12 month eriod. A resonse analysis survey RAS of aroximately 2000 establishments was taken from this oulation with the goal to understand the rocess drivers for committing errors so that BLS may target resources to establishments that mitigate them. The modeling focus is to identify robabilistic clusters of establishments with similar error atterns over the 12 month eriod and to examine the rocess by which establishments in each cluster construct their data submissions to BLS. The RAS survey design stratified the oulation of 10000 establishments based on henomena of interest exressed in ortions of each time series; for examle, a big jum in the reorted difference at year-end may indicate establishments who count checks that include regular ay and bonuses for each emloyee, instead of counting emloyees. Higher inclusion robabilities were assigned to those strata exressing henomena of relatively greater interest to BLS researchers. Modeling the number of and membershis in robabilistic clusters of error atterns exressed in the oulation from the RAS samle may be biased because the roortions of error atterns exressed in the samle are designed to be different from the oulation. Examle 4: The Current Exenditure CE survey is administered to U.S. households by BLS for the urose of determining the amount of sending for a broad collection of goods and service categories and it serves as the main source used to construct the basket of goods later used to formulate the Consumer Price Index. The CE emloys a multi-stage samling design that draws clusters of core-based statistical areas CBSAs, such as metroolitan and microolitan areas, from which Census blocks and, ultimately, households are samled. Economists desire to model the roensity or robability of urchase for a variety of goods and services. The balance of samled clusters may not be reflective of those in the oulation; for examle, if articularly high income ares are included in the samle. So inference on urchase roensities for the oulation made from the observed samle will be biased absent correction for the informative samling design. Examle 5: BLS administers the Job Oenings and Labor Turnover survey JOLTS to business establishments with the focus to measure labor market dynamics by reorting the number of job oenings, hires and searations, which is a leading indicator for emloyment trends. The samling design assigns larger inclusion robabilities to establishments with relatively more emloyees because larger establishments drive the variance in the reorted statistics. Our modeling goals are to understand differences in labor force dynamics based on emloyment ownershi e.g., rivate, ublic and region as art of imuting missing values with resect to the oulation generating distribution. As with the CES samling design, however, our samle will tend to overreresent relatively larger-sized establishments, so that inference and imutation using the samle will be biased for the oulation. We develo a multivariate count data oulation generating model in Section 4, where we illustrate the resulting estimation bias from failure to account for the correlations between assigned inclusion robabilities and the resonse variables of interest for our samle. The target audience for this article are data analysts who wish to erform some distributional inference using data obtained from an informative samle design on a oulation using a model they secify, y i λ, λ Λ, for

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 3 density,. We discuss, in the next section, how the limited literature on this toic does not adequately rovide a general method for making distributional inference on a oulation while adjusting for the unequal robabilities of selection. In this article, we roose an aroach that relaces the likelihood with the seudo likelihood Chambers & Skinner 2003, y i δ i = 1,λ w i, using samling weight, w i 1/π i. This re-weights the likelihood contribution for each observed unit with intent to re-balance the information in the observed samle to aroximate the balance of information in the target finite oulation; correcting for the informativeness. We show that the roosed method for Bayesian estimation on comlex samle data allows for asymtotically consistent inference on any oulation-generating model secified by the data analyst. Additionally, this method does not require information about the comlex design, other than the robabilities of selection, or about the full oulation, other than the observed data. We believe this makes the method alicable to more situations. Indeed, it is often the case that the data analyst does not have access to the full design information or auxiliary variables on the oulation, z 1,...,z N, used to assign the robabilities of selection π 1,...,π N. However, it is common for the robabilities of selection for the units in the samle, π 1,...,π n, to be rovided with the observed samle data. 1.2. Review of Methods to Account for Informative Samling One current aroach is to account for the informativeness by arameterizing the samling design into the model Little 2004. Parameterizing even a simle informative design is often difficult to accomlish and may disrut desired inference by requiring a change to the underlying oulation model arameterization. The analyst in Examle 3, above, desires to erform inference on an a riori unknown clustering of samled units with their oulation model for data acquired under a stratified samling design. Secifying random effects to be indexed by strata will likely conflict with the identification and comosition of inferred clusters. Further, the data analyst may not have access to the samling design, but only indirect information in form of samling weights. Lastly, the analyst is sometimes required to imute the unobserved units in the finite oulation, which may be comutationally infeasible. Another aroach incororates the samling weights into inference about the oulation, as is our intent, but requires a articular form for the likelihood that does not allow the analyst to imose their own oulation model formulation of inferential interest. For examle, Dong et al. 2014 secifies an emirical likelihood, while Kunihama et al. 2014 constructs a non-arametric mixture for the likelihood and Rao & Wu 2010 uses a samling-weighted seudo emirical likelihood. All of these aroaches imose Dirichlet distribution riors for the mixture comonents with hyerarameters secified as a function of the first-order samling weights. Si et al. 2015 regress the resonse variable on a Gaussian rocess function of the weights for samling designs where sub-grous of samled units have equal weights e.g., a stratified samling design. These aroaches are designed for inference about simle mean and total statistics, rather than inference for arameters that characterize an analyst-secified oulation model that is the focus for our roosed method. One method that uses a lug-in estimator, as do we in our method, is to construct a joint likelihood of the oulation distribution and samle inclusion in a simle logistic regression model Malec et al. 1999. This allows one to analytically marginalize over the arameters indexed by the non-samled units. This aroach is limited in alication to a class of simle oulation models that ermit analytic integration and may not be alied to more general classes of Bayesian models for the oulation that we envision in develoment of our aroach. Perhas the most general Bayesian aroach constructs models to co-estimate arameters for conditional exectations of inclusion robabilities jointly with the oulation-generating model arameters at each level of a hierarchical construction Pfeffermann et al. 2006. This formulation is fully Bayesian such that it accounts for all sources of uncertainty in oulation generation and inclusion of units, but requires a custom imlementation of an MCMC samler for each secified oulation model, such as their simle two-level linear regression model. The imlementations may increase the comlexity of the secified model and reduce the quality of osterior mixing in the MCMC, so that they are suitable for relatively simle oulation robability models. The method we roose is intended to allow Bayesian inference from any oulation model that may be secified by the the data analyst under an informative samling design, unlike the alternative methods. It rovides asymtotically unbiased estimation using only the distribution for the observed samle units and normalized Hájek-like samling weights. The lug-in tye method accounts for the informative samling design by raising

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 4 the likelihood contribution of each samled observation to the ower of their associated samling weight. The imlementation of the lug-in rocedure for Bayesian estimation multilies the samling weight into each full conditional log-osterior density. This can then samled in the tyical sequential scan MCMC. Unlike these other methods that are rominent in the literature, this method: 1. does not imose a oulation model imlicitly or exlicitly, unlike the most recently-develoed methods Dong et al. 2014, Kunihama et al. 2014, Rao & Wu 2010, Si et al. 2015; 2. requires only the samling weights and does not require arameterizing the samling design unlike Little 2004; 3. does not require a customized MCMC samling rocedure unlike Pfeffermann et al. 2006, so can be done automatically; 4. does not require imuting the non-samled units in the finite oulation. Our data alication and estimation model in the sequel are intended to be reresentative of common roblems for Bayesian inference, and the alication data are not readily estimated with these other methods that account for informative samling. We formulate the seudo-osterior density as samling weight-adjusted lug-in from which we conduct model inference about the oulation under a deendent, informative samling design in Section 2. Conditions are constructed that guarantee a frequentist L 1 contraction of the seudo osterior distribution on the true generating distribution in Section 3. We make an alication of the seudo osterior estimator to construct a regression model for count data using a dataset of monthly job hires and searations collected by the U.S. Bureau of Labor Statistics in Section 4. We reveal large differences for arameter estimates between incororation versus ignoring the samling weights. This section also includes a simulation study that comares the seudo osterior estimated on the observed samle to the osterior estimated on the entire finite oulation. The aer concludes with a discussion in Section 5. The roofs for the main result, along with two enabling results are contained in an Aendix. 2. Method to account for Informative Samling We begin by constructing the seudo likelihood and associated seudo osterior density under any analystsecified rior formulation on the model, λ Λ. 2.1. Pseudo Posterior Suose there exists a Lebesgue measurable oulation-generating density, π y λ, indexed by arameters, λ Λ. Let δ i {0,1} denote the samle inclusion indicator for units i = 1,...,N from the oulation under samling without relacement. The density for the observed samle is denoted by, π y o λ = π y δ i = 1,λ, where o indicates observed. The lug-in estimator for osterior density under the analyst-secified model for λ Λ is [ ] n ˆπ λ y o, w y o,i λ w i π λ, 1 where n y o,i λ w i denotes the seudo likelihood for observed samle resonses, y o. The joint rior density on model sace assigned by the analyst is denoted by π λ. This seudo likelihood emloys samling weights, { w i 1/π i }, constructed to be inversely roortional to unit inclusion robabilities. Each samling weight assigns the relative imortance of the likelihood contribution for each samle observation to aroximate the likelihood for the oulation. We use ˆπ to denote the noisy aroximation to osterior distribution, π, and we make note that the aroximation is based on the data, y o, and samling weights, { w}, confined to those units included in the samle, S. The total estimated osterior variance is regulated by the sum of the samling weights. We define unnormalized weights, {w i = 1/π i }, and subsequently normalize them, w i = w i w i n, i = 1,...,n, to sum to the samle size, n, the asymtotic units of information in the samle. Incororation of the samling weights to formulate the seudo osterior estimator is exected to increase the estimated arameter osterior variances relative to the unweighted osterior estimated on a simle random non-informative samle because the weights encode the uncertainty with which samles reresent the finite oulation under reeated samling. This increase in estimated osterior variance may be artly or wholly offset to the extent that the informative samling design is more efficient than simle random samling; for examle, a stratified samling design that takes simle random

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 5 samles within each stratum may roduce samles that rovide better coverage of the oulation. Although our method utilizes the weights as a lug-in, rather than imosing a rior, Pfeffermann & Sverchkov 2009 use Bayes rule to demonstrate one may relace the weights with their conditional exectation given the observed resonse to correct for informative samling. Relacing the raw weights with their conditional exectation given the observed resonse may serve to reduce the total variation attributed to weighting and the resulting osterior uncertainty in the case where the actual samled observations exress information in different roortions than intended in the samling design. Even though the conditional distribution of the weights given the resonse is generally different for the observed samle than for the oulation, nevertheless their conditional exectations are equal. 3. Pseudo Posterior Consistency We formulate a seudo osterior distribution in this section and secify conditions under which it contracts on the true generating distribution in L 1. Let Z + index a sequence of finite oulations, {U } =1,...,N, each of size, U = N, such that N < N, for <, so that the finite oulation size grows as increases. Suose that X,1,...,X,N are indeendently distributed according to some unknown distribution P, with density, defined on the samle sace, X,A. If Π is a rior distribution on the model sace, P,C to which P is known to belong, then the osterior distribution is given by ΠB X 1,...,X N = P B N N 0 X i dπp 0 X i dπp, 2 for any B C, where we refer to {X,i },...,N as {X i },...,N for readability when the context is clear. Ghosal & van der Vaart 2007 study the rate at which this osterior distribution converges to the assumed true and fixed generating distribution P 0. They rove, under certain conditions on the model sace, P, and the rior distribution, Π, that in P 0 robability, the osterior distribution concentrates on an arbitrarily small neighborhood of P 0 as N. The observed data on which we focus is not the entire finite oulation, X 1,...,X N, but rather a samle, X 1,...,X n, with n N, drawn under a samling design distribution alied to the finite oulation under which each unit, i 1,...,N, is assigned a robability of inclusion in the samle. These unit inclusion robabilities are constructed to deend on the realized finite oulation values, X 1,...,X N, at each. 3.1. Pseudo Posterior Distribution A samling design is defined by lacing a known distribution on a vector of inclusion indicators, δ = δ 1,...,δ N, linked to the units comrising the oulation, U. The samling distribution is subsequently used to take an observed random samle of size n N. Our conditions needed for the main result emloy known marginal unit inclusion robabilities, π i = Pr{δ i = 1} for all i U and the second-order airwise robabilities, π i j = Pr{δ i = 1 δ j = 1} for i, j U, which are obtained from the joint distribution over δ 1,...,δ N. The deendence among unit inclusions in the samle contrasts with the usual iid draws from P. We denote the samling distribution by P. Under informative samling, the marginal inclusion robabilities, π i = P{δ i = 1}, i 1,...,N, are formulated to deend on the finite oulation data values, X N = X 1,...,X N. Since the resulting balance of information would be different in the samle, the osterior distribution for X 1 δ 1,...,X N δ N, that we emloy for inference about P 0, is not equal to that of Equation 2. Our task is to erform inference about the oulation generating distribution, P 0, using the observed data taken under an informative samling design. We account for informative samling by undoing the samling design with the weighted estimator, X i δ i := X i δ i/π i, i U, 3 which weights each density contribution, X i, by the inverse of its marginal inclusion robability. This construction re-weights the likelihood contributions defined on those units randomly-selected for inclusion in the

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 6 observed samle {i U : δ i = 1} to aroximate the balance of information in U. This aroximation for the oulation likelihood roduces the associated seudo osterior, Π π B X 1 δ 1,...,X N δ N = P B N N X i δ i dπp X i δ i dπp, 4 0 that we use to achieve our required conditions for the rate of contraction of the seudo osterior distribution on P 0. We recall that both P and δ are random variables defined on the sace of measures and ossible samles, resectively. Additional conditions are later formulated for the distribution over samles, P, drawn under the known samling design, to achieve contraction of the seudo osterior on P 0. We assume measurability for the sets on which we comute rior, osterior and seudo osterior robabilities on the joint roduct sace, X P. For brevity, we use the suerscrit, π, to denote the deendence on the known samling robabilities, {π i },...,N ; for examle, Π π B X 1 δ 1,...,X N δ N := ΠB X 1 δ 1,...,X N δ N,π 1,...,π N. Our main result is achieved in the limit as, under the countable set of successively larger-sized oulations, {U } Z +. We define the associated rate of convergence notation, Ob, to denote lim Ob b = 0. 3.2. Emirical rocess functionals We emloy the emirical distribution aroximation for the joint distribution over oulation generation and the draw of an informative samle that roduces our observed data to formulate our results. Our emirical distribution construction follows Breslow & Wellner 2007 and incororates inverse inclusion robability weights, {1/π i },...,N, to account for the informative samling design, P π N = 1 N v N δ i π i δ X i, 5 where δ X i denotes the Dirac delta function, with robability mass 1 on X i and we recall that N = U denotes the size of of the finite oulation. This construction contrasts with the usual emirical distribution, P N = N 1 v N δ X i, used to aroximate P P, the distribution hyothesized to generate the finite oulation, U. We follow the notational convention of Ghosal et al. 2000 and define the associated exectation functionals with resect to these emirical distributions by P π N f = N 1 N δ i π i f X i. Similarly, P N f = N 1 N f X i. Lastly, we use the associated centered emirical rocesses, G π N = N P π N P 0 and GN = N P N P 0. The samling-weighted, average seudo Hellinger distance between distributions, P 1,P 2 P, d π,2 N 1, 2 = [ 1 N N δ i π i d 2 1 X i, 2 X i, where d 1, 2 = 1 ] 1 2 2 2 dµ for dominating measure, µ. We need this emirical average distance metric because the observed samle data drawn from the finite oulation under P are no longer indeendent. The imlication is that our result aly to finite oulations generated as inid from which informative samles are taken. The associated non-samling Hellinger distance is secified with, d 2 N 1, 2 = 1 N N d2 1 X i, 2 X i. 3.3. Main result We roceed to construct associated conditions and a theorem that contain our main result on the consistency of the seudo osterior distribution under a class of informative samling designs at the true generating distribution, P 0. Our aroach extends the main in-robability convergence result of Ghosal & van der Vaart 2007 by adding new conditions that restrict the distribution of the informative samling design. Suose we have a sequence, ξ N 0 and N ξ 2 N and n ξ 2 N as Z + and any constant, C > 0, A1 Local entroy condition - Size of model su logn ξ /36,{P P N : d N P,P 0 < ξ },d N N ξn 2, ξ >ξ N

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 7 A2 Size of sace ΠP\P N ex N ξn 2 21 + 2C A3 Prior mass covering the truth ] 2 Π P : P 0 log 0 ξ 2N P 0 [log 0 ξ 2N ex N ξn 2 C A4 Non-zero Inclusion Probabilities su 1 γ, with P 0 robability 1. min π i i U A5 Asymtotic Indeendence Condition lim su max π i j i j U 1 π i π j = O N 1, with P0 robability 1 such that for some constant, C 3 > 0, N su max i j U [ πi j π i π j ] C 3, for N sufficiently large. A6 Constant Samling fraction For some constant, f 0,1, that we term the samling fraction, lim su n f N = O1, with P 0 robability 1. Condition A1 denotes the logarithm of the covering number, defined as the minimum number of balls of radius ξ /36 needed to cover {P P N : d N P,P 0 < ξ } under distance metric, d N. This condition restricts the growth in the size of the model sace, or as noted by Ghosal et al. 2000, the sace, P N, must be not too big in order that the condition secifies an otimal convergence rate Wong & Shen 1995. This condition guarantees the existence of test statistics, φ n X 1 δ 1,...,X N δ N 0,1, needed for enabling Lemma B.1, stated in the Aendix, that bounds the exectation of the seudo osterior mass assigned on the set {P P N : d n P,P 0 ξ N }. Condition A3 ensures the rior, Π, assigns mass to convex balls in the vicinity of P 0. Conditions A1 and A3, together, define the minimum value of ξ N, where if these conditions are satisfied for some ξ N, then they are also satisfied for any ξ > ξ N. Condition A2 allows, but restricts, the rior mass laced on the uncountable ortion of the model sace, such that we may direct our inference to an aroximating sieve, P N. This sequence of saces trims away a ortion of the sace that is not entroy bounded in condition A1. In ractice, trimming the sace may usually be erformed to ensure the entroy bound. The next three new conditions imose restrictions on the samling design and associated known distribution, P, used to draw the observed samle data that, together, define a class of allowable samling designs on which the contraction result for the seudo osterior is guaranteed. Condition A4 requires the samling design to assign a ositive robability for inclusion of every unit in the oulation because the restriction bounds the samling inclusion robabilities away from 0. Since the maximum inclusion robability is 1, the bound, γ 1. No ortion of the oulation may be systematically excluded, which would revent a samle of any size from containing information about the oulation from which the samle is taken. Condition A5 restricts the result to samling designs where the deendence among lowest-level samled units attenuates to 0 as ; for examle, a two-stage samling design of clusters within strata would meet this condition if the number of oulation units nested within each cluster from which the samle is drawn increases in the limit of. Such would be the case in a survey of households within each cluster if the cluster domains are geograhically defined and would grow in area as increases. In this case of increasing cluster area, the deendence among the inclusion of any two households in a given cluster would decline as the number of households increases with the size of the area defined for that cluster. Condition A6 ensures that the observed samle size, n, limits to along with the size of the artially-observed finite oulation, N.

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 8 Theorem 3.1. Suose conditions A1-A6 hold. Then for sets P N P, constants, K > 0, and M sufficiently large, which tends to 0 as n,n. E P0,P Π π P : dn π P,P 0 Mξ N X 1 δ 1,...,X N δ N 16γ 2 [γ +C 3 ] K f + 1 2γ 2 + 5γ ex Kn ξn 2, 6 N ξn 2 2γ We note that the rate of convergence is injured for a samling distribution, P, that assigns relatively low inclusion robabilities to some units in the finite oulation such that γ will be relatively larger. Samles drawn under a design that exresses a large variability in the samling weights will exress more disersion in their information similarity to the underlying finite oulation. Similarly, the larger the deendence among the finite oulation unit inclusions induced by P, the higher will be C 3 and the slower will be the rate of contraction. The searability of the conditions on P and ΠP, on the one hand, from those on the samling design distribution, P, on the other hand, couled with the sequential rocess of taking the observed samle from the finite oulation reveal that the seudo osterior, defined on the artially-observed samle from a oulation, contracts on P 0 through converging to the osterior distribution defined on each fully-observed oulation. We demonstrate this roerty of the seudo osterior in a simulation study conducted in Section 4.1. By contrast, if the osterior distribution, defined on each fully-observed finite oulation, fails to meet conditions A1, A2 and A3 for the main result from Equation 6, such that it fails to contract on P 0, then the associated seudo osterior cannot contract on P 0, even if the samling design satisfies conditions A4, A5 and A6. The roof generally follows that of Ghosal et al. 2000 with substantial modification to account for informative samling. The L 1 rate of contraction of the seudo osterior distribution with resect to the joint distribution for oulation generation and the taking of informative samles is derived. Our aroach includes two unique enabling results. Please see Aendix sections A and B for details. 4. Alication We construct a model for count data and erform inference on survey resonses collected by the Job Oenings and Labor Turnover Survey JOLTS, introduced in Examle 5 of Section 1.1, which is administered by BLS on a monthly basis to a randomly-selected samle from a frame comosed of non-agricultural U.S. rivate business and ublic establishments. JOLTS focuses on the demand side of U.S. labor force dynamics and measures job hires, searations e.g. quits, layoffs and discharges and oenings. The JOLTS samling design assigns inclusion robabilities under samling without relacement to establishments to be roortional to the number of emloyees for each establishment as obtained from the Quarterly Census of Emloyment and Wages QCEW. This design is informative in that the number of emloyees for an establishment will generally be correlated with the number of hires, searations and oenings. We erform our modeling analysis on a May, 2012 data set of n = 8595 resonding establishments. We begin by secifying a finite oulation regression robability model from which we formulate the samlingweighted seudo osterior joint distribution that we use to make inference on model arameters from the oulation generating distribution with only the observed samle of a finite oulation. We demonstrate that failing to incororate samling weights e.g. by estimating the osterior distribution defined for the finite oulation on the observed samle roduces large differences in estimates of arameters. Our regression model defines a multivariate resonse as the number of job hires Hires for the first resonse variable and total searations Ses as the second resonse variable. We construct a single multivariate model as contrasted with the secification of two univariate models because these variables of interest tend to be highly correlated such that we exect the regression arameters to exress deendence; for examle, these two variables are correlated at 60% in our May 2012 dataset. We formulate a model for count data that accommodates the high degree of over-disersion exressed in our establishment-indexed multivariate resonses due to the large emloyment size differences across the establishments. Were we working with domain-indexed e.g., by state or county resonses, we may consider to use a Gaussian aroximation for the count data likelihood, but such is not aroriate for us due to the resence of

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 9 many small-sized establishments. The modeling of count data outcomes is very tyical for the analysis of BLS survey data for establishments focused on unemloyment. We secify the following count data model for the oulation, ind y id Poisexψ id 7 N D Ψ N D X P D D D B +N N D I N, Λ 1 8 B 0 + N P D P P M 1,[τ B Λ] 1 Λ W D D + 1,I D 10 τ B G 1,1 11 M W P P + 1,I P, 12 where i = 1,..., N indexes the number of establishments and d = 1,..., D indexes the number of dimensions for D 1 the multivariate resonse, Y. The N D log-mean, Ψ = ψ 1,...,ψ N, may be viewed as a latent resonse whose columns index the number of job hires Hires and total searations Ses under our JOLTS alication, so that D = 2. The number of redictors in the design matrix, X, is denoted by P and B are the unknown matrix of oulation coefficients that serve as the focus for our inference. Our model is formulated as a multivariate Poisson-lognormal model, under which the Gaussian rior of Equation 8 for the logarithm of the Poisson mean allows for over-disersion of different degrees in each of the D dimensions. The riors in Equation 8 and Equation 9 are formulated in matrix variate or, more generally, tensor roduct Gaussian distributions using the notation of Dawid 1981; for examle, the rior for the P D matrix of coefficients, B, assigns the P D mean 0 for a Gaussian distribution that emloys a searable covariance structure where the P P, M, denotes the recision matrix for the columns of B, and the D D, τ B Λ, denotes the recision matrix for the rows. This rior formulation is the equivalent of assigning a PD dimensional Gaussian rior to a vectorization of B accomlished by stacking its columns with PD PD recision matrix, M τ B Λ. See Hoff 2011 for more background. Precision matrices, M, Λ, each receive Wishart riors with hyerarameter values that imose uniform marginal rior distributions on the correlations Barnard et al. 2000. We regress the multivariate latent resonse, Ψ, on redictors reresenting the logarithm of the overall establishmentindexed number of emloyees Em, obtained from the QCEW, the logarithm of the number of job oenings Oen, region Northeast, South, West, Midwest Midw and ownershi tye Private, Federal Government, State Government State, Local Government Local. We convert region and ownershi tye to binary indicators and leave out the Northeast region and Federal Government ownershi to rovide the baseline of a full-column rank redictor matrix. We summarize our regression model on the logarithm scale by: ψ Hires,ψ Ses 1 + West + Midw + South + State + Local + Private + logem + logoens + error, where 1 denotes an intercet Int. Our oulation model is hyothesized to generate the finite oulation of the U.S. non-agricultural establishments, from which we have taken a samle of size n = 8595 for May, 2012 as our observations. For ease of reading, we will continue to use Y and X, to next define the associated seudo osterior, though each ossesses n < N rows reresenting the samled observations, in this context. The oulation model likelihood contribution for establishment, i, on dimension, d, is formed with the integration, y id x i,b,λ = y id ψ id ψ id x i,b,λdψ id, 13 R where samling weight, w i = 1/π i and w i = n w i / n w i, such that the adjusted weights sum to n, the asymtotic amount of information contained in the samle under a samling design that obeys condition A5. This integrated likelihood induces the following seudo likelihood, w i y id x i,b,λ = y id ψ id ψ id x i,b,λdψ id, 14 R 9

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 10 which is analytically intractable, so we erform the integration, numerically, in our MCMC using the rior for each ψ id exonentiated by the normalized samling weight, w i, which we use to construct its seudo osterior distribution. Using Bayes rule we resent the logarithm of the seudo osteriors for the latent set of D 1 logmean arameters, {ψ i }, which are a osteriori indeendent over i = 1,...,n, with, log ψ i y i,x i,b,λ D wi log exψ [ id y id x ex exψ id ] [N D ψ i ib,λ 1] w i d=1 D w i [y id ψ id exψ id ] 1 d=1 2 ψ i x ib 15a 15b w i Λ ψ i x ib, 15c where we note in the second exression in Equation 15c that the samling weights influence the rior recision for each ψ i, such that a higher-weighted observation will exert relatively more influence on osterior inference because this observation is relatively more reresentative of the oulation. We take samles from the seudo osterior distribution secified Equation 15c in our MCMC using the ellitical slice samler of Murray et al. ind 2010, where we draw ψ i N D x i B, w iλ 1 and formulate a roosal as a convex combination arameterized on an ellise of this draw from the rior and the value selected on the revious iteration of the MCMC. We evaluate each roosal using the weighted likelihood in the first exression of Equation 15c. We next illustrate the construction of the seudo osterior distribution for the P D matrix of regression coefficients, B, which by D-searation is indeendent of the observations, y id, given ψ id, [ ] n B Y,X,Ψ,Λ,M,τ B N n D ψ i B x i,i n,λ 1 w i N P D B M 1,τ B Λ 1 16a log B Y,X,Ψ,Λ,M,τ B n [ wi 2 log Λ w i 2 ψ i B x i Λ ψ ] i B x i + log N P D B M 1,τ B Λ 1. 16b In a Bayesian setting, the sum of the weights n = n w i imacts the estimated osterior variance as we observe in Equation 16b. We see that weights scale the quadratic roduct of the Gaussian kernel in Equation 16b such that we may accomlish the same result using the matrix variate formation to define the seudo likelihood, N n D Ψ XB W,Λ 1, where W = diag w 1,..., w n, the weights for the samled observations, from which we comute the following conjugate conditional seudo osterior distribution defined on the n observations, B Y,X,Ψ,Λ,M,τ B = h π B + N P D B φ π B 1,Λ 1, 17 where φ π B = X WX + τ B M and h π B = φπ B 1 X WΨ. Under emloyment of a simler continuous resonse framework, the conditional osterior for B retains the same form as Equation 17, excet the latent resonse on the logarithm scale, Ψ, would be relaced by the observed data, Y. Intuitively, we note using a samling-weighted seudo rior for the latent resonse, Ψ, for samling coefficients, B, is analogous to using the samling-weighted likelihood in the case of an observed, continuous resonse, Y. Each lot anel in Figure 1 comares estimated osterior distributions for a coefficient in B within 95% credible intervals, labeled by redictor, dimension of the multivariate resonse, when alied to the May, 2012 JOLTS dataset between two estimation models: 1. The left-hand lot in each anel emloys the samling weights to estimate the seudo osterior for B, induced by the seudo osterior for the latent resonse in Equation 15c; 2. The right-hand lot estimates the coefficients using the osterior distribution defined on the finite oulation, which may be achieved by relacing W by the identity matrix to equally weight establishments. Equal weighting of establishments assumes that the samle reresents the same balance of information as the oulation from which it was drawn, which is not the case under an informative samling design. Comaring estimation results from the seudo osterior and oulation osterior distributions rovides one method to assess the sensitivity of estimated arameter distributions to the samling design.

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 11 0.9 0.8 0.7 Hires Em -4.0-4.5 Hires int 0.0-0.2-0.4-0.6 Hires Local 0.3 0.2 0.1 Hires Midw 0.5 0.4 0.3 Hires Oens Distribution within 95% CI for Coefficient 0.75 0.50 0.25-4.5-4.8-5.1-5.4 0.4-5.0 weight ignore weight ignore weight ignore weight ignore weight ignore Hires Hires Hires Hires Ses Private South State West Em 0.6 0.4 0.2 0.90 0.2 0.2 0.85 0.1 0.1 0.0 0.80-0.2 0.0 0.0 0.75-0.4 weight ignore weight ignore weight ignore weight ignore weight ignore Ses Ses Ses Ses Ses int Local Midw Oens Private 0.4 0.2 0.35 0.9 0.0 0.3 0.30-0.2 0.6 0.2-0.4 0.25 0.3-0.6 0.1 weight ignore weight ignore weight ignore weight ignore weight ignore Ses Ses Ses South State West 1.00 0.75 0.4 0.3 0.50 0.3 0.2 0.1 0.25 0.00 weight ignore weight ignore weight ignore Resonse - Predictor 0.2 0.1 Fig 1: Comarison of osterior densities for the each coefficient in the P = 9 D = 2 coefficient matrix, B, within 95% credible intervals, based on inclusion samling weights in a seudo osterior the left-hand lot in each anel and exclusion of the samling weights using the osterior distribution defined for the oulation in the right-hand lot. Each lot anel is labeled by redictor,resonse for the two included resonse variables, Hires, and Ses total searations. We observe that the estimated results are quite different in both location and variation between estimations erformed under the seudo osterior and oulation osterior distributions, indicating a high degree of informativeness in the samling design. The 95% credible intervals for the coefficients of the continuous redictors - the log of job oenings Oens and emloyment Em - don t even overla on both the number of hires Hires and searations Ses resonses. The coefficient for the State ownershi redictor and the number of hires resonse is bounded away from 0 when estimated under the unweighted oulation osterior, but is centered on 0 under the samling-weighted, seudo osterior. The coefficient osterior variances estimated on the observed samle under the oulation osterior are understated because they don t reflect the uncertainty with which the information in the samle exresses that in the oulation which is catured through the samling weights. 4.1. Simulation Study We imlement a simulation study to comare the marginal seudo osterior distributions to the unweighted oulation osterior distributions for the regression coefficients, where both are estimated on the observed samle drawn under an informative samling design. For this study we use the N = 8595 observations from the JOLTS May, 2012 data as our oulation. We take 100 Monte Carlo samles of size n = 500,1000,1500,2500 establishments using an informative single-stage samle design with unequal inclusion robabilities based on the roortional to size samle used for the real JOLTS survey. Characteristics of the the samling design, used for this study, at each samle size are resented in Table 4.1. This samling design will induce distributions of the observed samles that will be different from those for the oulation. The designed correlation between the resonse and inclusion robabilities will roduce observed samles with values skewed towards higher numbers of hires and searations than in the oulation. Figure 2 demonstrates this difference between the distributions for realized samles under the informative samling deimsart-ejs ver. 2014/10/16 file: EJS1153.tex date: June 7, 2016

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 12 n CUs minπ maxπ CVπ Cory hires,π Cory Ses,π 1 500 56 0.02 1.00 2.11 0.80 0.62 2 1000 196 0.04 1.00 1.60 0.69 0.50 3 1500 357 0.07 1.00 1.29 0.61 0.44 4 2500 722 0.14 1.00 0.91 0.51 0.36 Table 1: Characteristics of single stage, fixed size s samling design used in simulation study. n denotes the samle size. CUs denotes the number of certainty units with inclusion robabilities equal to 1. π denotes the inclusion robabilities roortional to square root of JOLTS emloyment, CVπ denotes the coefficient of variation of π, Cory hires,π denotes correlation of the number of hires and π and Cory Ses,π denotes the correlation of the number of searations and π. sign comared to those for the finite oulation. The left-most box lot in each of the two anels dislays the oulation distribution for a resonse value. A single samle is drawn under a sequence of increasing samle sizes for illustration. The next set of box lots dislays the resulting distributions for the resonse values in each samle with size increasing from left-to-right. The left-hand lot anel dislays the distributions for the Hires resonse, while the right-hand anel dislays those for the Ses searations resonse variable. Pseudo osterior and oulation osterior distributions are estimated on each Monte Carlo samle at each samle size in n. Figure 3 comares estimation of the osterior distribution from the fully-observed oulation left-hand box lot to estimation using the seudo osterior from samle observations taken under the roortional-to-size samling design. The third box lot in each anel shows the estimation of the osterior distribution estimated on the same samle ignoring the informative samling design. The last box lot in each anel dislays the estimates of the osterior distribution from a simle random samle of the same size, where no correction for the samling design is required, as a gold standard against which to measure the erformance of the seudo osterior distribution. We estimate the distributions on each of the 100 Monte Carlo draws for each samle size and concatenate the results such that they incororate both the variation of oulation generation and reeated samling from that oulation. The samle sizes, n, increase from left-to-right across the lot anels. The to set of lot anels dislay the osterior distributions of the regression coefficient for the emloyment redictor Em and the hires resonse Hires, while the bottom set of anels dislay the coefficient distributions for the emloyment redictor Em and the total searations resonse Ses.

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 13 Hires Ses 400 300 Distribution of Resonse Values 200 100 0 o 100 500 1000 2000 Samle Size o 100 500 1000 2000 Fig 2: Distributions of resonse values for oulation comared to informative samles. The left-most box lot in each of the two lot anels contains the distribution for the JOLTS samle that we use as our oulation in the simulation study. The next set of box lots show the distribution for the resonse values for increasing samle sizes from left-to-right for each samle drawn under our single stage roortion-to-size design. The left-hand lot anel dislays the Hires resonse variable and the right-hand anel dislays the Ses searations resonse variable.

T. D. Savitsky et al./bayesian Estimation Under Informative Samling 14 500 1000 1500 2500 0.8 Distribution within 95% CI for Coefficient 0.7 0.6 0.5 0.9 0.8 0.7 Em_Hires Em_Ses 0.6 o weight ignore srs o weight ignore srs o Samle Size weight ignore srs o weight ignore srs Fig 3: Comarison of osterior densities for 2 coefficients, Emloyment-Hires to row of lot anels and Emloyment-Searation bottom row of lot anels in B, within 95% credible intervals, between estimation on the oulation left-hand lot in each anel, estimations from informative samles data taken from that oulation, which include samling weights in a seudo osterior the second lot from the left in each anel and exclusion of the samling weights using the oulation osterior distribution the third lot from the left under a simulation study. The right-most lot resents the osterior density estimated from a simle random samle of the same size for comarison. The simulation study uses the May, 2012 JOLTS samle as the oulation and generates 500 informative samles for a range of samle sizes of 500, 1000, 1500, 2500, from left-to-right under a samling without relacement design with inclusion robabilities set roortionally to the square root of emloyment levels. A searate estimation is erformed on each Monte Carlo samle and the draws from estimated distributions are concatenated over the samles.