STATISTICAL INFERENCE WITH DATA AUGMENTATION AND PARAMETER EXPANSION

Similar documents
Shu Yang and Jae Kwang Kim. Harvard University and Iowa State University

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

Plausible Values for Latent Variables Using Mplus

EM Algorithm II. September 11, 2018

Basics of Modern Missing Data Analysis

A note on multiple imputation for general purpose estimation

Fractional Imputation in Survey Sampling: A Comparative Review

Parametric fractional imputation for missing data analysis

The propensity score with continuous treatments

Bayesian Analysis of Latent Variable Models using Mplus

Biostat 2065 Analysis of Incomplete Data

Default Priors and Effcient Posterior Computation in Bayesian

A Note on Bayesian Inference After Multiple Imputation

Streamlining Missing Data Analysis by Aggregating Multiple Imputations at the Data Level

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

Statistical Practice

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

COMPETING RISKS WEIBULL MODEL: PARAMETER ESTIMATES AND THEIR ACCURACY

On the Conditional Distribution of the Multivariate t Distribution

Introduction An approximated EM algorithm Simulation studies Discussion

Convergence Rate of Expectation-Maximization

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ANALYSIS OF PANEL DATA MODELS WITH GROUPED OBSERVATIONS. 1. Introduction

Learning Gaussian Process Models from Uncertain Data

Lecture 7 Introduction to Statistical Decision Theory

Statistical Methods for Handling Missing Data

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

University of Toronto Department of Statistics

Lecture 8: Information Theory and Statistics

EM for ML Estimation

Estimating the parameters of hidden binomial trials by the EM algorithm

The Expectation Maximization Algorithm

Overlapping Astronomical Sources: Utilizing Spectral Information

Will Penny. SPM short course for M/EEG, London 2013

Will Penny. DCM short course, Paris 2012

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Contents. Part I: Fundamentals of Bayesian Inference 1

Bayesian Estimation of Prediction Error and Variable Selection in Linear Regression

Bayesian Statistics Adrian Raftery and Jeff Gill One-day course for the American Sociological Association August 15, 2002

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

Introduction to Bayesian Statistics 1

On Reparametrization and the Gibbs Sampler

Pooling multiple imputations when the sample happens to be the population.

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Analysis of Gamma and Weibull Lifetime Data under a General Censoring Scheme and in the presence of Covariates

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Master s Written Examination

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Review. December 4 th, Review

Robust Monte Carlo Methods for Sequential Planning and Decision Making

The Variational Gaussian Approximation Revisited

Statistical Estimation

Finite Population Sampling and Inference

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

Robustness to Parametric Assumptions in Missing Data Models

Miscellanea A note on multiple imputation under complex sampling

Uncertainty Quantification for Inverse Problems. November 7, 2011

F-tests for Incomplete Data in Multiple Regression Setup

Multiple Imputation Methods for Treatment Noncompliance and Nonresponse in Randomized Clinical Trials

1. Fisher Information

Gibbs Sampling in Endogenous Variables Models

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Bayesian inference for factor scores

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Nonresponse weighting adjustment using estimated response probability

Noncompliance in Randomized Experiments

A note on shaved dice inference

The Minimum Message Length Principle for Inductive Inference

Topics and Papers for Spring 14 RIT

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Bayesian Methods for Machine Learning

ANALYSIS OF ORDINAL SURVEY RESPONSES WITH DON T KNOW

Expectation Propagation for Approximate Bayesian Inference

Mixture Models and Expectation-Maximization

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Reconstruction of individual patient data for meta analysis via Bayesian approach

Econometrics Summary Algebraic and Statistical Preliminaries

Uncertainty in energy system models

Multiple Imputation for Missing Data in Repeated Measurements Using MCMC and Copulas

K-ANTITHETIC VARIATES IN MONTE CARLO SIMULATION ISSN k-antithetic Variates in Monte Carlo Simulation Abdelaziz Nasroallah, pp.

Parametric Techniques Lecture 3

Simulation of truncated normal variables. Christian P. Robert LSTA, Université Pierre et Marie Curie, Paris

Heriot-Watt University

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science

MS&E 226: Small Data

Introduction to Statistical Methods for High Energy Physics

MISSING or INCOMPLETE DATA

To appear in The American Statistician vol. 61 (2007) pp

Using prior knowledge in frequentist tests

Nonparametric Drift Estimation for Stochastic Differential Equations

An introduction to Variational calculus in Machine Learning

Parametric Techniques

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Transcription:

STATISTICAL INFERENCE WITH arxiv:1512.00847v1 [math.st] 2 Dec 2015 DATA AUGMENTATION AND PARAMETER EXPANSION Yannis G. Yatracos Faculty of Communication and Media Studies Cyprus University of Technology December 4, 2015 1

Summary Statistical pragmatism embraces all efficient methods in statistical inference. Augmentation of the collected data is used herein to obtain representative population information from a large class of non-representative population s units. Parameter expansion of a probability model is shown to reduce the upper bound on the sum of error probabilities for a test of simple hypotheses, and a measure, R, is proposed for the effect of activating additional component(s) in the sufficient statistic. Some key words: Collected Data Augmentation; Parameter Expansion; Representative Information; Statistical Pragmatism 2

1 Introduction In recent years, response rates in polls have decreased further and a representative sample may not always become available. For example, using random digit dialing (RRD) the response rate decreased from 36% in 1997 to 9% in 2012 (Kohut et al. 2012). According to Wang et al. (2015), it is convenient and cost-effective to collect a very large nonrepresentative sample via online surveys and obtain with statistical adjustments accurate election forecasts, on par with those based on traditional representative polls. We study initially the problem of obtaining representative information for the population, from a large number of non-representative population s units with a common attribute, A. This attribute could be, e.g., an account in Facebook or following a celebrity in a Social Network; the latter occured in practice with Social Voting Advice Applications (Katakis et al., 2014). Under assumptions occuring also in practice, it is expected that units with attribute A can provide each additional, accurate information for one of the remaining units in the population without attribute A. Due to the large number of units with attribute A, the soaugmented data-information from all strata can be used to obtain information equivalent to that from a representative sample (Kruskal and Mosteller, 1979). The second problem studied is model parameter expansion (PX), which is shown to reduce the upper bound on the sum of error probabilities, when testing two simple hypotheses with a test introduced by Kraft (1955). The proof confirms that parameter expansion is activating a sufficient statistic with additional component(s) (Rubin, 1997) and the effect of the activation is clarified with the introduced measure, R, obtained with a one-parameter expansion. These results explain why the PX-EM algorithm (Liu et al.,1998) converges faster than the EM algorithm (Dempster et al., 1977) and its many variations. Fisher (1922) introduced in statistical inference the use of a model that is not updated. Essentially all models are wrong but some are very useful (attributed to George Box in Rubin, 2005). The need for improvement of estimation procedures led statisticians to relax the use of the oneandonly assumed model by adopting, for example, the Bayesian model 3

averaging approach (see e.g., Hoeting et al., 1999). Expansion of a probability model or the augmentation of collected data improve, respectively, the data s fit and the estimates of the model s parameters. Artificial data augmentation has been used in missing value problems (e.g. see Rubin, 1987), to improve the convergence of the EM algorithm (see, e.g., Meng and van Dyk, 1997) and to reduce the mean squared error of U-statistics, in particular the unbiased estimates of variance and covariance (Yatracos, 2005). Parameter expansion (Meng and van Dyk, 1997, Liu et al., 1998) and model updated maximum likelihood estimate (MUMLE, Yatracos, 2015) are also examples of deviations from the one-model approach. Kass (2011) introduced statistical pragmatism. It is a new philosophy which is inclusive of the Bayesian, frequentist and all other efficient approaches in inference. Statistical pragmatism emphasizes the assumptions that connect statistical models with the observed data and it is implemented herein with parameter expansion and augmentation of the collected data. The additional contribution of the latter is that: a) it introduces a new component in the statistical inference set-up: each unit in the population provides, in addition, data for other units, and b) it includes as goal to obtain representative population information when representative units/sample are not available. 2 Representative Information from Non-Representative Units with Augmentation of the Collected Data Accessible units in the population with attribute A responding to a questionnaire are not representative, e.g., when their age is between 25 and 35, and the collected information is not comparable to that of a random sample. Assumptions are made for attribute A and the population, allowing to obtain representative population information from these non-representative units. Assumption 1-Common Attribute: Attribute A is common in the population: the number of respondents with attribute A is large, each respondent has in its immediate environment one associated unit without attribute A, and collectively the associated units come 4

from all strata. Assumption 2- Effective Information: A large percentage of units with attribute A have each accurate, questionnaire related information, for its associated unit without attribute A. Definition 2.1 Representative population information is equivalent to that obtained from a representative sample from the population. The Common Attribute and the Efficient Information assumptions guarantee that a large number of units responding to the questionnaire will provide each information for its associated unit, thus information will be obtained from all strata. A representative sample, e.g. random sample, from the so-obtained information will provide representative information for the population. The degree of accuracy of the information given from the respondent for a unit without attribute A may be clarified from the respondent s answers to the questionnaire. Given the large number of respondents, only the most accurate information for associated units without attribute A will be used. The additional information provided by some of the units may introduce bias, thus bias reduction methods will guarantee the best result. The proposed method of data augmentation will be used in Voting Advice Applications before the general elections in Spain (December 20, 2015). Note that collected data augmentation.differs from the notion of data augmentation introduced in Tanner and Wong (1987) and Gelman (2004). 3 Parameter Expansion in Statistical Procedures For the EM-algorithm and its variations, the observed data model f(x obs θ) and the augmented (called also complete) data model f(x com θ) have the same parameter θ. For the PX-EM algorithm (Liu et al.,1998), f(x com θ) contains another parameter with known value η 0 and is expanded to a larger model f X (x com (θ,η)) with θ playing the role of θ. 5

The complete data model is preserved when η = η 0, that is, f X (x com (θ,η 0 )) = f(x com θ = θ ). In several examples in the same paper it is observed that the PX-EM algorithm is faster than the EM-algorithm and its variations. Testing simple hypotheses is an elementary but fundamental problem in statistical inference. For example, families of tests of simple hypotheses allow to obtain a consistent estimate, with calculation of rates of convergence of this estimate in Hellinger distance (LeCam, 1986). Improved tests will increase the accuracy of the so-obtained estimate. It will be shown that a test of simple hypotheses is improved with parameter expansion. In the sequel, omitted domains of integration are determined by the integrands-densities. Definition 3.1 For densities f,g defined on R the Hellinger distance, H(f,g), is given by H 2 (f,g) = [f 1/2 (x) g 1/2 (x)] 2 dx. The affinity of f,g is ρ(f,g) = f 1/2 (x)g 1/2 (x)dx. For H(f,g) and ρ it holds: H 2 (f,g) = 2(1 ρ(f,g)) 2; (1) H 2 (f,g) = 2 if and only if ρ(f,g) = 0, i.e., if f(x)g(x) = 0 almost surely. For a sample X 1,...,X n with density either f n or g n, the larger H 2 (f n,g n ) is the easier is to determine either the true density of the data or, in parametric models, the true parameter. Consistent testing is guaranteed because lim n + H 2 (f n,g n ) = 2. AssumethatsampleX 1,...,X n hasdensityf(x θ)andthatadifferentmodelparameter, η, with known value η 0 is already included in the model. It is shown that the upper bound on the sum of error probabilities of a consistent test introduced by Kraft (1955) for hypotheses H 0 : θ = θ 0 and H 1 : θ = θ 1, (2) 6

is reduced with parameter expansion. Let t n,1 = t 1 be the sufficient statistic for θ with density g(t 1 θ). Kraft (1955) provided the consistent test g(t1 θ 1 ) φ n = I( > 1); (3) g(t1 θ 0 ) I denotes the indicator function. Note that the criticisms of the classical α-level test for the preferential treatment of H 0 and the predetermined level of the test do not hold for test (3). The error probabilities when using φ n are E H0 φ n g(t1 θ 1 )g(t 1 θ 0 )dt 1, g(t 1 θ 1 )> g(t 1 θ 0 ) E H1 (1 φ n ) g(t1 θ 1 )g(t 1 θ 0 )dt 1. g(t1 θ 0 ) g(t 1 θ 1 ) Then, the sum of error probabilities E H0 φ n +E H1 (1 φ n ) g(t1 θ 1 )g(t 1 θ 0 )dt 1. (4) Lower values of the affinity g(t 1 θ 1 )g(t 1 θ 0 )dt 1 indicate an increase in the separation of the densities g(t 1 θ 0 ) and g(t 1 θ 1 ), making easier to distinguish between the two hypotheses. Consider the expanded model f(x θ,η) with the new sufficient statistics (t 1,t 2,n = t 2 ) that have joint density h(t 1,t 2 θ,η) = g(t 1 θ,η) g(t 2 t 1,θ,η). (5) Note that g(t 1 θ,η = η 0 ) = g(t 1 θ) and that g is the conditional density of t 2 given t 1. For the expanded model (5) and the hypotheses testing problem H 0 : θ = θ 0,η = η 0 and H 1 : θ = θ 1,η = η 0, (6) statistic t 2 is activated and a consistent test for these hypotheses similar to φ n is g(t1 θ 1 ) g(t 2 t 1,θ 1,η 0 ) ψ n = I( g(t1 θ 0 ) g(t 2 t 1,θ 0,η 0 ) > 1). 7

g(t2 t 1,θ 1,η 0 ) g(t 2 t 1,θ 0,η 0 )dt 2 < 1, (8) For the sum of error probabilities of ψ n it holds g(t1 E H 0 ψ n +E H 1 (1 ψ n ) θ 1,η 0 )g(t 1 θ 0,η 0 ) g(t 2 t 1,θ 1,η 0 ) g(t 2 t 1,θ 0,η 0 )dt 2 dt 1. (7) When g depends on θ and t 1 is fixed, it follows that g(t 2 t 1,θ 1,η 0 ) and g(t 2 t 1,θ 0,η 0 ) are not equal a.s. t 2 and from Cauchy-Schwarz inequality for integrals for every t 1. Using Fubini s theorem in the right side of (7) and (8) it follows that g(t1 θ 1,η 0 )g(t 1 θ 0,η 0 ) g(t g(t1 2 t 1,θ 1,η 0 ) g(t 2 t 1,θ 0,η 0 )dt 2 dt 1 < θ 1 )g(t 1 θ 0 )dt 1. From (4),(8) and (9) it follows that the upper bound on the sum of error probabilities in (7) for the expanded model is smaller than that of the original model. (9) Definition 3.2 The effect of activating component t 2 in the sufficient statistic of the expanded model (5) with respect to model (2) is measured by the difference of the affinities, R = g(t1 θ 1 )g(t 1 θ 0 )dt 1 h(t1,t 2 θ 1,η 0 )h(t 1,t 2 θ 0,η 0 )dt 1 dt 2 ; (10) R is proportional to the difference of Hellinger distances of the models in (2) and in (5). Acknowledgments Many thanks are due to my colleague Dr. Nicolas Tsapatsoulis and Dr. Fernando Mendez for introducing me to the theory of Social Voting Advice Applications and to practical statistical problems remaining to be solved. Very many, special thanks are due to the Department of Statistics and Applied Probability, National University of Singapore, for the warm hospitality during my summer visits when most of the results were obtained. This research was partially supported by a grant from Cyprus University of Technology. 8

References [1] Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood estimation from incomplete data via the EM algorithm (with Discussion). J. R. Statist. Soc. B 39, 1-38. [2] Draper, D. (1995) Assessment and propagation of model uncertainty. J. R. Statist. Soc. B 57, 45-97. [3] Fisher, R. A. (1922) On the mathematical foundations of theoretical statistics. Phil. Trans. A 222, 309-368. [4] Gelman, A. (2004) Parameterization and Bayesian modeling. JASA 99, 537-546. [5] Hoeting, J., Madigan, D., Raftery, A. E. and Volinsky, C. (1999) Bayesian model averaging. Statistical Science 14, 382-401. [6] Kass, R. E. (2011) Statistical Inference: The Big Picture. Statistical Science 26, 1, 1-9. [7] Katakis, I., Tsapatsoulis, N., Mendez, F., Triga, V. and Djouvas, C. (2014) Social Voting Advice Applications - Definitions, Challenges, Datasets and Evaluation. IEEE Transactions on Cybernetics 44, 7, 1039-1052. [8] Kraft, C. (1955) Some conditions for consistency and uniform consistency of statistical procedures. University of California Publications in Statistics 2, 125-142. [9] Kruskal, W. and Mosteller, F. (1979) Representative Sampling, II: Scientific Literature, Excluding Statistics International Statistical Review 47, 2, 111-127 [10] LeCam, L. M. (1986) Asymptotic Methods in Statistical Decision Theory. Springer, New York. [11] Liu, C., Rubin, D. B. and Wu, Y. N. (1998) Parameter expansion to accelerate EM: The PX-EM algorithm. Biometrika 85, 755-770. 9

[12] Meng, X. L. and van Dyk, D. A. (1997) The EM algorithm an old folk song sung to a fast new tune (with Discussion). J. R. Statist. Soc. B 59, 511-567. [13] Rubin, D. B. (2005) Causal inference using potential outcomes: Design, Modeling, Decisions. JASA 100, 322-331. [14] Rubin, D. B. (1997) Discussion of the paper by Meng and van Dyk. J. R. Statist. Soc. B 59, 541-542. [15] Rubin, D. B. (1987) Multiple imputation for nonresponse in surveys. Wiley, New York. [16] Tanner, A. M. and Wong, W. H. (1987) The calculation of posterior distributions by data augmentation (with discussion) JASA 82, 528-550. [17] Wang, W., Rothschild, D., Goel, S., and Gelman, A. (2015) Forecasting elections with non-representative polls. International Journal of Forecasting 31, 3, 980-991. [18] Yatracos, Y. G. (2015) MLE s bias pathology, model updated MLE and Wallace s minimum message length method. IEEE Transactions on Information Theory 61, 3, 1426-1431. [19] Yatracos, Y. G. (2005) Artificially augmented samples, shrinkage and MSE reduction. JASA 100, 1168-1175. 10