Combination of evidence from multiple administrative data sources: quality assessment of the Austrian register-based Census 2011

Size: px

Start display at page:

Download "Combination of evidence from multiple administrative data sources: quality assessment of the Austrian register-based Census 2011"

Gloria Reeves
5 years ago
Views:

1 18 Statistica Neerlandica (2012) Vol. 66, nr. 1, pp doi: /j x Combination of evidence from multiple administrative data sources: quality assessment of the Austrian register-based Census 2011 Christopher Berka, Stefan Humer and Mathias Moser Department of Economics, WU Vienna, Augasse 2-6, 1090 Vienna, Austria Manuela Lenk*, Henrik Rechta and Eliane Schwerer Statistics Austria, Guglgasse 13, 1110 Vienna, Austria This article investigates the quality of register data in the context of a standardized quality framework. The special focus of this work lies on the assessment of census data and how to deal with uncertainty that arises from multiple sources (registers). To take the uncertainty associated with support and conflict between several registers into account, Dempster Shafer s theory of evidence is applied. This fuzzy approach allows us to investigate the quality of databases with multiple underlying sources for a single attribute and to provide both quality measures and plausibility intervals. Keywords and Phrases: Dempster Shafer theory, fuzzy logic. 1 Introduction The importance of register-based information for statistical analyses has increased strongly in recent years. It is obvious that the use of such already recorded information reduces the respondent s burden significantly, as opposed to survey data. Therefore utilizing these data sources becomes increasingly popular among NSIs (National Statistical Institutions). In addition, recent changes in legislation urge Statistics Austria to intensify the use of register data to minimize the costs of collecting data. Accordingly, Statistics Austria will rely on administrative data for the Austrian Census in In contrast to the Test-Census in 2006, the census itself will not be accompanied by supplementary questionnaires. This shift from a traditional census (using surveys) towards a register-based one also challenges the process of quality measurement. For surveys a widely agreed methodology exists in terms of quality assessment, such as sampling errors. In contrast to surveys, there is no such toolkit for the evaluation of register-based data. * manuela.lenk@statistik.gv.at Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA 02148, USA.

2 Evidence from multiple administrative data sources 19 As a consequence, various methods have been established to assess the quality of this kind of data. First and foremost, it is common to double-check register data by issuing questionnaires to measure and enhance the quality of administrative records. As outlined before, for Austria a Test-Census in 2006 was accompanied by such an evaluation survey which provided valuable information on the quality of Austrian registers (see Lenk, 2008). However, an exhaustive examination of registers through sample surveys is an intensive process and only feasible for countries which have a long-lasting experience in the usage of registers (see, e.g. Hokka and Nieminen, 2008). Since the use of administrative records for statistical purposes is a relatively new approach in Austria, a different methodology for the quality measurement is needed. In general, the quality assessment of register data has to fulfill several properties, for example, transparency, accuracy and feasibility. To achieve these goals we set up a general framework, which allows to evaluate the quality of registers with regard to all available information. This also includes metadata, which is an enhanced approach apposed to classical data evaluation methods. With respect to the data flow for the census, the assessment consists of three levels: raw data (registers), combined dataset (Census Database or CDB) and imputed dataset (Final Database). In this context we refer to databases as registers which are created by rulesets based on the raw data. Therefore, they are not administrative sources themselves but merely derived from administrative register information. At each of these levels changes in quality are monitored to give a quality measure for each attribute j in each register or database i. Quality information on raw data level is derived from three hyperdimensions: documentation (HD D ), preprocessing (HD P ) and external source (HD E ). HD D includes all quality-related aspects prior to seeing the data and condenses them to the degree of confidence we put in the data owner. These are for example plausibility checks, data collection methods or legal enforcements of data recording. This is realized through a questionnaire which is filled out in accordance with the register authority. For each question there is a maximum score that can be obtained. The second aspect on the raw data level focuses on the data preprocessing (HD P ) at Statistics Austria. It measures the formal correctness by checking for implausibilities, errors or missing items in the data. Subtracting all these erroneous entries from the total number of observations leads to the number of usable records (see Table 1). In a third step, we investigate the accuracy of the data by comparing it with an external source (HD E ). This is primarily done using existing surveys (i.e., the Austrian Microcensus), because we can link the data with a unique key on the unit level to compare the values and check for consistency. If an attribute is not found in the Microcensus we rely on expert opinions. The expert is a person at Statistics Austria, who is responsible for the administrative register and therefore has the most experience concering the quality of the data. The quality measures from these three hyperdimensions range between 0 and 1 (the closer to 1, the better is the quality) and are defined as follows:

3 20 Christopher Berka et al. Table 1. HD preprocessing Number of observations Records without unique ID Records with item non-response (but including unique IDs) Records with wrong values or values out of range = Usable records HD D obtained score : maximum score, HD P usable records : total number of records, HD E number of consistent values : total number of linked records. In a next step, these three hyperdimensions can be combined to form a quality indicator per attribute and register which comprises all available information. To achieve this we use weights for each hyperdimension, which resemble the relative importance of each hyperdimension for the quality assessment. The weighted average of the hyperdimensions gives the quality indicator q ij per register and attribute (see Equation 1). q ij = v D hdij D + v P hdij P + v E hdij E = v k hdij k. (1) k D,P,E Furthermore, we can use this quality measure to assess the quality of databases, such as the CDB. Most important, this process is independent of data processing to avoid endogeneity problems. Further details on the general assessment and process flow can be found in Berka et al. (2010). If we focus on the rating of the CDB (or Ψ), the census itself can be seen as a process which assigns several socio-economic attributes to every relevant person. In case of multiple sources for the same attribute this is realized through a predefined ruleset, which picks the most appropriate information from the underlying registers. To measure the quality of this process, it is necessary to take the support or conflict that arises when we compare multiple sources with the CDB into account. Accordingly, we compare the values that the ruleset chooses for each person in the database with the entries in the raw data registers. If the entries match, we consider the ruleset to be good and the quality indicator will be high. In the terminology of our framework this is equivalent to the degree of confidence we have in the CDB. The sources that are used for comparison are the single registers on the raw data level (see Figure 1, left-hand side). When we compare the CDB with the raw data registers we can basically distinguish three cases: (a) a single comparison register is available (see Figure 1, attribute C), (b) multiple registers to compare with (see Figure 1, attribute A) and (c) no raw data registers with a similar attribute are disposable (see Figure 1, attributes F and G). Case (a) is trivial to assess since the confidence we put in the CDB is simply q ij, where q ij is the quality indicator for the specific attribute j in register i. Case (c) is more complicated to deal with, since the attribute in the CDB is derived from other attributes. Therefore, we cannot use the quality indicators of the base

4 Evidence from multiple administrative data sources 21 Figure 1. Quality framework registers directly. To create a consistent and comparable quality measure it is necessary to take the amount of information flow from base attributes to the derived attribute into account. This is subject to further research. In this article, we focus on the assessment of case (b). Applying weighted averages is problematic since in this standard approach we ignore the uncertainty that is associated with conflict among registers. More specifically a register that disagrees with the CDB does not necessarily imply that the latter is wrong. However, it may express a degree of uncertainty on the value of the database. Subsequently we will apply a specific form of this fuzzy logic, namely the Dempster Shafer theory. In this article, we apply this theory on the Austrian register-based census. However, since the framework was composed for generic process flows this is only one specific application area. In the next section, we will give a brief overview on fuzzy logic and fuzzy sets in general. Furthermore we provide more detailed information on the Dempster Shafer theory and give calculation examples. Section 3 introduces Dempster Shafer theory in our quality framework and outlines why and how it is applied. Finally, section 4 derives preliminary quality measures for selected attributes of the census. Statistical properties and calculation rules can be found in the Appendix. 2 Dempster Shafer theory Uncertainty plays an essential role in the analysis of complex systems. Nonetheless, the definition of uncertainty in such research tasks often remains ambiguous

5 22 Christopher Berka et al. or unclear. Usually the researcher encounters uncertainty as a dual phenomenon (see Helton, 1997): 1. Stochastic uncertainty results from the fact that systems can behave in different ways, i.e., it is a property of the system itself. 2. Epistemic uncertainty occurs due to a lack of knowledge about the system and is thus a methodological problem when performing an analysis. Accordingly, epistemic uncertainty deals with the lack of knowledge about the distribution of a certain variable itself, for example, whether the register represents the true values. Hacking (1975) traces this very important distinction between the two types of uncertainty back to the beginnings of probability theory. Helton (1997) states that as long as the separation between stochastic and epistemic uncertainty is not maintained carefully, an evaluation of the systems behavior and characteristics on rational basis becomes difficult or even impossible. It is common sense that the stochastic part of uncertainty is best dealt within the so-called frequentist approach, the most important discipline of the traditional probability theory. In contrast, the epistemic uncertainty is not considered carefully enough by such a theory. To deal with these shortcomings of traditional probability theory, we apply a fuzzy approach. Statistical fuzzy logic aims to explain epistemic uncertainty and tries to implement models for it. It can be regarded as an extension of the classical probability theory and will therefore yield the same results when no uncertainty is present. Platon already mentioned that besides the dual approach of either TRUE or FALSE there has to be a way to express uncertainty. Current applications of the fuzzy logic are mainly based on the ideas of Zadeh (1965). He introduces fuzzy sets, in which an element can be included or excluded, but he also allows for partial inclusion in the set. The degree of inclusion is given by a so-called membership function as a value in the interval [0, 1]. These functions exist for each element and combined they yield the so-called fuzzy functions. These functions are generated either through statistics or opinions of experts. To derive these expert opinions, a special form of fuzzy logic can be applied. More specifically we use an evidence theory that was proposed by Dempster (1968) and extended by Shafer (1992). This so-called Dempster Shafer theory focuses on a field of probability theory that is closely related to fuzzy logic. It allows to combine different beliefs about the reality, that is, expert opinions. Eventually this results in a measure of evidence, which can be interpreted as a probability. This approach is specifically useful when an expert cannot make a definitive statement about the probability that a specific event will occur. What he or she has is a fuzzy belief about the probability that a certain event will arise. In this case the belief of an expert may differ from that of other experts. The Dempster Shafer theory aims to combine these different beliefs to come to an overall idea of the probability, taking the uncertainty among different beliefs into account. Consider the treatment of an ill

6 Evidence from multiple administrative data sources 23 patient in a hospital. Some doctors may have different beliefs about the true reason for the sickness. One doctor might consider a malfunction of the liver as the reason and has a degree of belief of 90%. It could be that another doctor thinks of some other reasons and therefore beliefs that the malfunction of the liver is not the main cause. An easy way to evaluate the overall belief would be a simple averaging of the different beliefs. However, this approach does not consider the uncertainty nor possible conflicts between expert opinions that are closely connected to the different beliefs. The Dempster Shafer theory tries to overcome these shortcomings by considering the role of uncertainty and conflicts within its framework. The theory consists of three fundamental functions: the basic probability assignment function (bpa), the Belief function (Bel) and the Plausibility function (Pl) (Sentz and Ferson, 2002). More formally the power set 2 X is the set of all subsets of X, including the empty set and X itself. The elements of the power set 2 X can be considered as hypothesis of the condition of a complex system. For the census 2 X can be seen as all possible constellations of certainty and uncertainty between registers or merely a subset of them (see Table 4 and the explanation in chapter 3). The Dempster Shafer theory of evidence assigns a specific degree of belief to each element of 2 X. Formally, this is represented by the bpa (see Klir and Wierman, 1998). The bpa defines a mapping of the power set to the interval between 0 and 1. bpa :2 X [0, 1]. The bpa of the empty set is 0 and the summation of the bpas of all elements of the power set 2 X equals 1. bpa( ) = 0, A 2 X bpa(a) = 1. The value of the bpa describes the proportion of evidence that encourages the hypothesis that a particular element of X belongs to a specific element A 2 X of the power set, but not a particular subset of A. It is important to note that the bpa(a) of the set A does not imply any statement about the value of the bpa for a subset of A. Based on the bpa function we derive an interval that contains the precise probability of the condition A of a system. Bel(A) P(A) Pl(A). The lower bound Bel for a certain set A is the sum of all bpas of subsets B of the set of interest A. Bel(A) = bpa(b). B B A

7 24 Christopher Berka et al. The upper bound of the interval is defined as the measure of Pl and is evaluated by summing up all bpas of the set B that intersect the set of interest A. Pl(A) = bpa(b). B B A The Bels as well as the Pls do not have to sum up to 1. They are non-additive since they are sums over an arbitrary number of subsets B out of A. Furthermore, due to the fact that the bpas sum up to 1, the Pl can be derived from the measure of Belief and vice versa. Pl(A) = 1 Bel( A). Furthermore, we can also compute bpas from, for example, given Bels. If Bel(A) equals Pl(A) the probability of the condition A of a process is explicitly determined. Accordingly, the Dempster Shafer theory of evidence yields the same results as traditional probability theory. In the presence of epistemic uncertainty the values of the two measures differ and form an interval of lower and upper bounds of probabilities. The actual value of probability is included in the interval composed of the Bel and Pl. Some examples for bpa function, the Bel function and the Pl can be found in the Appendix Table A1. An advantage of Dempster Shafer s theory of evidence is its capability of combining information from independent sources when epistemic uncertainty is present. Generally, the intention of data aggregation is to summarize and simplify information. Widely used aggregation methods are the evaluation of averages (either in arithmetic, geometric or harmonic form) or the selection of particular properties of the data (e.g., minimum, maximum or median of an empirical distribution). Combination rules can be seen as a derivation of such rather simple aggregation techniques. Their purpose is to aggregate evidence about the condition of a system obtained from multiple data origins. Examples for different sources of information depend strongly on the field of application. Their role can be taken by a group of experts (e.g., doctors), a number of sensors (airborne radar stations) or various administrative registers, which deliver information on certain attributes of statistical units of the population. The initial rule for the combination of evidence within the Dempster Shafer theory is the so-called Dempster rule (see Dempster, 1967). It can be regarded as a generalization of Bayes rule and conflates multiple Bel functions by aggregating their functions (see Equation 2). Dempster s rule is a strictly conjunctive procedure and accents agreement of multiple sources of information. bpa 1,2 (A) = (bpa 1 bpa 2 )(A) = 1 bpa 1 (B) bpa 2 (C). (2) 1 K B C = A Conflicting evidence is considered through the normalization factor 1 K, where K stands for the sum of bpas assorted with conflict, as can be seen in Equation 3. Set-theoretic, these are all products of bpas where the intersection equals.

8 Evidence from multiple administrative data sources 25 K = bpa 1 (B) bpa 2 (C). (3) B C = This property of Dempster s rule induced heavy criticism by Zadeh (1986) and Yager (1987). As a consequence, various combination rules were proposed in the literature, like, for example, Yager s rule or Dubois and Prade s disjunctive pooling rule. In the application on the quality measurement of combined administrative data sources, we disregard their conceivable arguments because Dempster s rule is associative (Joshi, Sahasarabudhe and Shankar, 1995). Consequently, the succession of the multiple sources has no impact on the results of our analysis. Examples for coinciding and conflicting evidence is provided in the Appendix (see Table A2). 3 Application For the register-based census we use the Dempster Shafer theory to combine quality indicators from different data sources. The quality framework aims to deliver a quality indicator for each attribute in the CDB, which contains information on the population in Austria (e.g., residency, sex, status of employment). It is filled from different administrative data sources (registers) based on a predefined ruleset. The quality indicators of the attributes in this CDB are derived based on the quality measures from the original registers. If there is only one base register available to compare with, the indicator also resembles the quality of the CDB. In the case of multiple attributes, several registers have information over the same attribute, for example, sex may be included in four registers [case (b), see section 1]. Since these sources can be regarded as different opinions (or beliefs) on a common subject (the attribute) it allows for the implementation of the Dempster Shafer theory. In a first step, we assign a certain mass of certainty (C) and uncertainty (U) to each attribute in each register, which is based on the quality measures of these attributes q ij. This yields 2 n (n being the number of registers with the same attribute) possible combinations of certainty and uncertainty. For the case of n = 2, that would be: CC (both certain), CU or UC (one register uncertain), UU (both registers uncertain). These different cases can be grouped into agreement, uncertainty, logical impossibility and conflicting evidence. This process depends on the values of the attribute in the different registers, for example, CC can be agreement (if both registers are sure the person is female ) but in another case it can be a logical impossibility (if one register is sure that the person is male and the other is sure that the person is female ). In some cases there are a lot of registers which contain information about the same attribute. If the number of registers becomes large, computational difficulties will arise because of the assignment of the constellations of certainty and uncertainty to the corresponding groups mentioned before. We solve them by creating a look-up table for each case, which contains possible combinations of certainty

9 26 Christopher Berka et al. Table 2. Symbol Bel ξ ω U n Used symbols Name Degree of belief Normalisation of uncertainty Logical impossibility Conflict Uncertainty (C) and uncertainty (U). In the second step, we simply take the combination (e.g. CCCC) from the look-up table, which corresponds to our actual case. Note that different registers may show differing values for the same observation. It is possible that register 1 is absolutely certain that an individual is male, while register 2 is sure that this person is female (case CC), which would be a logical impossibility. Therefore, it depends on the values within the different registers if CC can be regarded as agreement or logical impossibility. Accordingly, it is possible to calculate different beliefs, for example, the belief that register 1 shows the true value or that register 2 is correct. The combination rules are calculated based on the degree of agreement, uncertainty, logical impossibility and conflicting evidence. An example of the assignment to the different groups can be found in the Appendix. Bel = 1 ω U n. (4) 1 The general Equations 4 and 5 are applied to each observation for each attribute. The CDB marks the actual belief for a certain observation and therefore defines which belief is calculated. ξ = U n 1. (5) Accordingly, if an individual is male according to the CDB we check the reliability of this information using the comparison registers. For each observation, measures of Bel and Pl are constructed. These figures can be interpreted as a confidence interval for the accuracy of each observation k. The quality indicator for each observation is now computed as the mean of Bel and Pl: q Ψ,Ak = Bel(A k) + Pl(A k ). 2 The overall quality indicator for attribute A is computed as the average over the whole population. q Ψ,A = 1 n (Bel(A k ) + Pl(A k )). 2n k = 1 We will present not only the mean for each attribute (although it is the most important figure for our application as it represents the quality indicator for a multiple attribute within the CDB) but also other moments and distribution measures

10 Evidence from multiple administrative data sources 27 such as the standard deviation or quantiles. These indicators give information on the accuracy and deviations of our results and therefore deliver a more sophisticated picture of the quality assessment for an attribute in the CDB. 4 Results Suppose we derived the following quality indicators for the attribute sex in four registers (REG 1 REG 4 ) in the first step of the quality framework: see Table 3, where the columns represent different quality aspects. These quality measures are combined using weighted averages. In this case, we weighted each quality aspect equally. q i,sex is thus given by q i,sex = 1 3 HDD HDP HDE. There is no specific rational behind this weighting. One could apply sensitivity analyses to get an idea of the impact of the weights. However this will be of interest if more quality indicators for a greater number of attributes are available and is thus subjected to future research. For details on the calculations of these indicators, we refer to Berka et al. (2010). Since in this case we can use information on sex from four registers there exist 2 4 possible constellations of certainty (C) and uncertainty (U), as can be seen in Table 4. The bpas can then be derived by multiplying the q ij of the registers according to the certainty uncertainty setting, for example, for UUCC using the values from Table 3: bpa UUCC = ( ) ( ) = The decision whether we need to use q ij or its complementary probability is made by the CDB. Suppose the CDB regards a specific person as male. If this person is also male in REG 3 then the register has a certainty (C) ofq ij that this is correct. Another register REG 1 may believe the person is female, hence it is uncertain (U) with the value (1 q ij ) about the person being male. Accordingly, certainty and uncertainty are defined by the values of the register s attribute compared with the CDB. An interesting question is how frequent the different cases of supportive and conflicting evidence appear. In 99.7% we find congruent evidences in the register. (See bold values in Table A3.) Table 3. Quality indicators for sex in four selected registers Register HD D HD P HD E q i,sex REG REG REG REG

11 28 Christopher Berka et al. Following this short example we will present first results of the application of the Dempster Shafer theory on the quality framework for the Austrian census. We will again focus on the attribute sex for reasons of simplifications. Table 5 shows some distribution figures for the average q Ψ,sex of the upper bound (Pl) and lower bound (Bel), which can be interpreted as an aggregated quality indicator. The CDB contains observations on the attribute sex. The most important moment is the mean (μ) which gives an idea of the overall quality of the attribute sex within the CDB. However, the other measures show that even on the unit level the quality indicators are very high. For the 5% percentile the quality indicator is already very close to 1. This concentration is also supported by a very low standard deviation (σ). We get a few observations with extremely low quality measures while the majority has rather high quality measures. Accordingly, the mean is shifted to the left and falls below the 5% percentile. On the whole, our quality measures are extremely left-skewed. Consequently, we reach a high degree of confidence that the attribute sex has a very high quality within the CDB. Both measures, the mean as well as the deviation, provide important information on the quality. For another attribute we may find a high value for the mean but rather high deviations, which could indicate that one should take a further look at a certain subsample of the population. An extraction of the CDB is given by Table A4. The last three columns show the Bel as lower bound, the Pl as upper bound and the average q Ψ,sexk of both bounds, respectively. The rows 5 and 6 illustrate the impact of the application of the Dempster Shafer theory. In both cases, the majority of the registers as well as the CDB believe that the unit is female (2). However in row 5, one register believes that the unit is male (1). The problem here is that this register is the one with the highest rating according to the hyperdimensions (see Table 3). These Table 4. Possible combinations of certainty and uncertainty for four registers Constellation bpa Constellation bpa UUUU CUUU UUUC CUUC UUCU CUCU UUCC CUCC UCUU CCUU UCUC CCUC UCCU CCCU UCCC CCCC Note: bpa, basic probability assignment. Table 5. Results of the Dempster Shafer application for the attribute sex Measure of q Ψ,sex Value Measure of q Ψ,sex Value Observations Percentile μ Percentile σ Median Minimum Percentile Maximum 1 Percentile

12 Evidence from multiple administrative data sources 29 circumstances lead to a low degree of belief for the unit in row 5 compared with the one in row 6. This means that the differences in the degree of belief between units can become very high, only because of one differing opinion from a very trustfully register. 5 Conclusion This article investigated the application of the Dempster Shafer theory on the quality assessment of register data. With special focus on the Austrian Census in 2011 we applied this fuzzy logic approach to evaluate rulesets based on register information. In this context we calculated the beliefs that the ruleset s choice for an attribute is the true value based on the underlying data. This was done for the case of four registers which have different opinions over the true value of the attribute sex. The degree of belief we put in each register is derived from several hyperdimensions which try to condense all the available quality-related information in each register. Using these indicators we derive all possible combinations of certainty and uncertainty between the four base registers. In a next step we use combination rules from the Dempster Shafer framework to aggregate this (un-)certainty value. We found that the support for the values chosen in the CDB is relatively high for the attribute sex. From this it follows that the ruleset (on which the CDB is created) is rather good. Accordingly it assigns the most probable values for this attribute to each statistical unit, according to the base registers. Further research involves the investigation of case (c) (see section 1), where the CDB attribute is derived from various other variables. In this case, the raw register does not have an opinion on the attribute in the CDB itself. They only provide quality indicators for a part of the information that enters the according attribute in the CDB. References Berka, C., S. Humer, M. Lenk, M. Moser, H. Rechta and E. Schwerer (2010), A quality framework for statistics based on administrative data sources using the example of the Austrian Census Austrian Journal of Statistics 39, Dempster, A. P. (1967), Upper and lower probabilities induced by a multivariate mapping. Annals of Mathematical Statistics 38, Dempster, A. P. (1968), A generalization of Bayesian inference. Journal of the Royal Statistical Society, Series B (Methodological) 30, Hacking, I. (1975), The emergence of probability: a philosophical study of early ideas about probability, induction and statistical inference. Cambridge University Press, Cambridge. Helton, J. C. (1997), Uncertainty and sensitivity analysis in the presence of stochastic and subjective uncertainty, Journal of Statistical Compution and Simulation 57, Hokka, P. and M. Nieminen (2008), Measuring the quality of the Finnish population register with a survey. Special focus on non-response, in: European Conference on Quality in Official Statistics, Statistics Finland, Eurostat.

13 30 Christopher Berka et al. Joshi, A. V., S. C. Sahasarabudhe and K. Shankar (1995), Sensitivity of combination schemes under conflicting conditions and a new method, in: J. Wainer and A. Carvalho (eds), Advances in artificial intelligence: 12th Brazilian Symposium on artificial intelligence, Springer, Berlin. Klir, G. J. andm. J. Wierman (1998), Uncertainty-based information: elements of generalized information theory, Physica-Verlag, Heidelberg. Lenk, M. (2008), Methods of register-based census in Austria, Statistics Austria. Sentz, K. ands. Ferson (2002), Combination of evidence in Dempster Shafer theory, Sandia National Laboratories, Albuquerque. Shafer, G. (1992), Dempster Shafer theory, in: S. C. Shapiro (ed.), Encyclopedia of artificial intelligence, Wiley, New York. Yager, R. R. (1987), On the Dempster Shafer framework and new combination rules, Information Sciences 41, Zadeh, L. A. (1965), Fuzzy sets, Information and Control 8, Zadeh, L. A. (1986), A simple view of the Dempster Shafer theory of evidence and its implication for the rule of combination, AI Magazine 7, 85. Received: 26 May Revised: 10 October Appendix Dempster Shafer theory X = {true, false} 2 X = {, true, false, {true, false}} Basic probability assignment bpa 1 ( ) = 0 bpa 1 (true) = 0.9 = C bpa 1 (false) = 0 bpa 1 (true, false) = 0.1 = U Belief Bel 1 (A = ) = bpa 1 ( ) = 0 Bel 1 (A = true) = bpa 1 (true) + bpa 1 ( ) = 0.9 Bel 1 (A = false) = bpa 1 (false) + bpa 1 ( ) = 0 Bel 1 (A = true, false) = bpa 1 (true, false) + bpa 1 (true) + bpa 1 (false) + bpa 1 ( ) = 1 Plausibility Pl 1 (A = ) = bpa 1 ( ) = 0 Pl 1 (A = true) = bpa 1 (true) + bpa 1 (true, false) = 1

14 Evidence from multiple administrative data sources 31 Table A1. bpa, Bel and Pl Condition bpa Bel Pl Null True False True or false bpa, basic probability assingment; Bel, belief; pl, plausibility. Pl 1 (A = false) = bpa 1 (false) + bpa 1 (true, false) = 0.1 Pl 1 (A = true, false) = bpa 1 (true, false) + bpa 1 (true) + bpa 1 (false) + bpa 1 ( ) = 1 Dempster rule of combination Case I: Coinciding Evidence bpa 1,2 (true) = bpa 1 (t) bpa 2 (t) + bpa }{{} 1 (t) bpa 2 (t, f ) + bpa }{{} 1 (t, f ) bpa 2 (f ) }{{} CC CU UC bpa 1,2 (true, false) = bpa 1 (t, f ) bpa 2 (t, f ) }{{} UU Case II: Conflictive Evidence bpa 1 (t) bpa 2 (t) = = CC = logical impossibility bpa 1 (t, f ) bpa 2 (f ) = = UC bpa 1 (t) bpa 2 (t, f ) = = CU bpa 1 (t, f ) bpa 2 (t, f ) = = UU = uncertainty bel 1,2 (REG 1 = true) = bpa 1(t) bpa 2 (t, f ) 1 bpa 1 (t) bpa 2 (t) = CU 1 CC UC UU = 1 CC 1 CC bel 1,2 (REG 2 = true) = bpa 1(t, f ) bpa 2 (t) 1 bpa 1 (t) bpa 2 (t) = UC 1 CC CU UU = 1 CC 1 CC Application bel 1,2 (REG 1 and REG 2 = t, f ) = bpa 1(t, f ) bpa 2 (t, f ) 1 bpa 1 (t) bpa 2 (t) = UU 1 CC UC UU = 1 CC 1 CC REG 1 = REG 2 = REG 3

15 32 Christopher Berka et al. Table A2. Information from three Registers, quality indicator = 0.9 REG 1 REG 2 REG 3 bpa 1 C C C C C U C U C C U U U C C U C U U U C U U U Bel REG1, 2, 3 = REG 1 = REG 2 = REG 3 CCC + CCU + + UCU + UUC 1 U n {}}{ = 1 UUU ξ = U n {}}{ UUU 1 }{{} 1 {}}{ Bel REG1, 2 = 1 {}}{{}}{ (CCC + CUC + UCC) UUC UUU 1 (CCC + CUC + UCC) UUC Bel REG3 = 1 (CCC + CUC + UCC) UUU ξ = 1 (CCC + CUC + UCC) REG 1 = REG 2 = REG 3 ω U n CUU Bel REG1 = 1 (CCC + CCU + CUC) UUU ξ = 1 (CCC + CCU + CUC) Bel REG2, 3 = UCC + UCU + UUC 1 (CCC + CCU + CUC) REG 1 = REG 3 = REG 2 CUC + CUU + UUC Bel REG1, 3 = 1 (CCC + CCU + UCC) UUU ξ = 1 (CCC + CCU + UCC) Bel REG2 = UCU 1 (CCC + CCU + UCC)

16 Evidence from multiple administrative data sources 33 Results Table A3. Frequency table of constellations Constellation Frequency % Constellation Frequency % , , , ,406, ,577, Table A4. Extraction of the Census Database CDB R 1 R 2 R 3 R 4 CONST ω U n Bel Pl q Ψ,sexk

Reasoning with Uncertainty

Reasoning with Uncertainty Representing Uncertainty Manfred Huber 2005 1 Reasoning with Uncertainty The goal of reasoning is usually to: Determine the state of the world Determine what actions to take