Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics

Size: px

Start display at page:

Download "Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics"

Oliver Carroll
5 years ago
Views:

1 Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods em algorithm, Gibbs sampling, and Bound and Collapse (bc) to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in eciency using bc. Reprinted from: Proceedings of Uncertainty 99: Seventh International Workshop on Articial Intelligence and Statistics, Morgan Kaufmann, San Mateo, CA, Address: Marco Ramoni, Knowledge Media Institute,, Milton Keynes, United Kingdom MK7 6AA. phone: +44 (1908) , fax: +44 (1908) , m.ramoni@open.ac.uk, url:

2 Learning Conditional Probabilities from Incomplete Data: An Experimental Comparison Marco Ramoni Knowledge Media Institute Paola Sebastiani Statistics Department Abstract This paper compares three methods em algorithm, Gibbs sampling, and Bound and Collapse (bc) to estimate conditional probabilities from incomplete databases in a controlled experiment. Results show a substantial equivalence of the estimates provided by the three methods and a dramatic gain in eciency using bc. 1. INTRODUCTION The estimation of conditional probabilities from a database of cases plays a central role in a variety of machine learning domain and approaches, from classication, to clustering, neural networks, and learning of Bayesian Belief Networks (bbns). This paper will focus on bbns as they have been proved to be general enough to encompass a variety of models, including classication models (Friedman & Goldsmidth, 1996), clustering methods (Heckerman, 1996), and feed-forward neural networks (Buntine, 1994), so that the results here presented can easily be applied to other domains. A Bayesian Belief Network (bbn) is dened by a set of variables X = fx1; : : : ; X I g and a directed acyclic graph dening a model M of conditional dependencies among the elements of X. A conditional dependency links a child variable X i to a set of parent variables i, and it is dened by the conditional distributions of X i given the congurations of the parent variables. We consider discrete variables only and denote by c i the number of states of X i and X i = x ik by x ik. The structure M yields a factorization of the joint probability of a set of values x k = fx1k; : : : ; x Ik g of the variables in X as p(x Q I = x k ) = i=1 p(x ikj ij ), where ij denotes one of the q i states of i in x k. Suppose we are given a database of n cases D = fx1; : : : ; x n g and a graphical model M specifying the dependencies among the variables X. Our task is to estimate, from D, the conditional probabilities ijk = p(x ik j ij ; ) dening the dependencies in the graph, where = ( ijk ). When the database is complete, closed form solutions allow ecient estimation of the conditional probabilities in time proportional to the size of the database. The Maximum Likelihood estimates (mle) of ijk are the values ^ ijk that maximize the likelihood function l() = Q ijk n(x ikjij) ijk where n(x ik j ij ) is the frequency of (x ik ; ij ) in D. Thus, ^ijk = n(x ik j ij )=n( ij ) which is the relative frequencies of relevant cases, and n( ij ) = P k n(x ikj ij ) is the frequency of ij. The Bayesian approach generalizes the mle by introducing a attening constant ijk > 0 for each frequency, so that the estimate is computed as ^ ijk = ijk + n(x ik j ij )? 1 ij + n( ij )? c i (1) P where ij = k ijk. The quantity ( ijk?1)=( ij?c i ) is interpreted as the prior probability believed by the learning system before seeing any data, and the learning process is regarded as the updating of this prior probability by means of Bayes' theorem. When the prior distribution of ( ij1:::; ijci ) is Dirichlet, with hyperparameters ( ij1; :::; ijc1 ), then (1) is the posterior mode, the Maximum a Posteriori (map), of ijk. Unfortunately, simplicity and eciency of these closed form solutions are lost when the database is incomplete, that is, some entries are reported as unknown. In this case, the exact estimate is the mixture of the estimates given by (1) for each database generated by the combination of all possible values for each missing entry, and the computational cost of this operation would grow exponentially in the number of missing data. Simple solutions to handle this problem are either to ignore the cases including missing entries or to ascribe the missing cases to an ad hoc dummy state. Both solutions introduce potentially dangerous bias in the

3 estimate, as they ignore the fact that the missing entries may be relevant to the estimation of. Current methods rely on more sophisticated techniques, such as the em algorithm (Dempster, Laird, & Rubin, 1977) and Gibbs Sampling (gs) (Geman & Geman, 1984). Both em and gs provide reliable estimates of the parameters and they are currently regarded as the most viable solutions to the problem of missing data. However, both these iterative methods can be trapped into local minima and the convergence detection can be dif- cult. Furthermore, they rely on the assumption that data are missing at random (mar): within each parent conguration, the available data are a representative sample of the complete database and the distribution of missing data can be therefore inferred from the available entries (Little & Rubin, 1987). When this assumption fails, and the missing data mechanism is not ignorable (ni), the accuracy of these methods can dramatically decrease. Finally, the computational cost of these methods heavily depends on the absolute number of missing data, and this can prevent their scalability to large databases. This situation motivated the recent development of a deterministic method, called Bound and Collapse (bc) (Ramoni & Sebastiani, 1998) to estimate conditional probability distributions from possibly incomplete databases. bc was proved to be extremely scalable, as the complexity of the algorithm does not depend on the number of missing data. This paper presents an experimental comparison of em, gs and bc on a real-world database. The aim of this comparison is to evaluate whether the signicant computational advantage oered by bc reduces, in practice, the accuracy of the estimation. 2. METHODS This paper will compare the accuracy and the eciency of em, gs, and bc. This section outlines the character of these three methods. The em algorithm is an iterative method to compute mles and map. The em algorithm alternates an expectation step, in which unknown quantities depending on the missing entries are replaced by their expectation in the likelihood, to a maximization step, in which the likelihood completed in the expectation step is maximized with respect to the unknown parameters, and the resulting estimates are used to replace unknown quantities in the next expectation step, until the difference between successive estimates is smaller than a xed threshold. The convergence rate of this process can be \painfully slow" (Thiesson, 1995), and several modication have been proposed to increase its speed under certain circumstances. gs is one of the most popular Markov Chain Monte Carlo methods for Bayesian inference. The algorithm treats each missing entry as a parameter to be estimated and iterates a stochastic process that provides a sample from the posterior distribution of. This sample is used to compute empirical estimates of the posterior mean, the posterior mode, or any other function of the parameters. In practical applications, the algorithm iterates a number of times to reach stability (burn-in) and then a nal sample from the joint posterior distribution of the parameters is taken. Both em and gs provide reliable estimates of the parameters and they are currently regarded as the most viable solutions to the problem of missing data. However, both these iterative methods can be trapped into local minima and the convergence detection can be dicult. Furthermore, they rely on the assumption that the missing data mechanism is ignorable: within each observed parent conguration, the available data are a representative sample of the complete database and the distribution of missing data can be therefore inferred from the available entries (Little & Rubin, 1987). When this assumption fails, and the missing data mechanism is not ignorable (ni), the accuracy of these methods can dramatically decrease. Finally, the computational cost of these methods heavily depends on the absolute number of missing data, and this can prevent their scalability to large databases. These limitations motivated the development of bc, a deterministic method to estimate conditional probabilities from an incomplete database. The method bounds the set of possible estimates consistent with the available information by computing the minimum and the maximum estimate that would be obtained from all possible completions of the database. This process returns probability intervals containing all possible estimates consistent with the available information. These bounds are then collapsed into a unique value via a convex combination of the extreme points with weights depending on the assumed pattern of missing data. The intuition behind bc is that an incomplete database is still able to constrain the possible estimates within a set and that the assumed pattern of missing data, encoded as probability of missing data, can be used to select a point estimate within the set of possible ones. Let X i be a variable in X with parent variables i. Denote by n(?j ij ) the frequency of cases in which only the entry on the child variable is missing, by n(x ik j?) the frequency of cases in which only the parent con- guration is unknown but it can be completed as ij, and by n(?j?) the frequency of cases in which the en-

4 tries X i ; i are unknown and they can be completed as (x ik ; ij ). Dene virtual frequencies as: n (x ik j ij ) = n(?j ij ) + n(x ik j?) + n(?j?) Xci n (x ik j ij ) = n(?j ij ) + n(x ih j?) + n(?j?): h6=k=1 Thus, n (x ik j ij ) is the maximum achievable frequency of (x ik ; ij ) in the incomplete database that is used to compute the maximum of p(x ik j ij ). The virtual frequency n (x ik j ij ) is used to compute the minimum estimate of p(x ik j ij ). It can be easily shown that the estimate of p(x ik j ij ) is bounded above by: p (x ik j ij ; D) = ijk + n(x ik j ij ) + n (x ik j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i and below by p (x ik j ij ; D) = ijk + n(x ik j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i : The minimum estimate of ijk is achieved when the missing data mechanism is such that, with probability 1, all cases (X i =?; ij ) and (X i =?; i =?) are completed as (x ih ; ij ), for h 6= k, all cases (X i = x ih ; i =?) are completed as (x ih ; ij ) if h 6= k, and, if h = k, as (x ik ; ij 0), for some j 0 6= j. Thus, n (x ik j ij ) is the total virtual frequency of cases with i = ij, for h 6= k, and the frequency of (x ik ; ij ) is not augmented. The maximum estimate of ijk is achieved when the missing data mechanism is such that, with probability 1, all cases (X i =?; ij ), (X i =?; i =?) and (X i = x ik ; i =?) are completed as (x ik ; ij ) and, for h 6= k, (X i = x ih ; i =?) are completed as (x ih ; ij 0), for some j 0 6= j. Thus, n (x ik j ij ) is the virtual frequency of (x ik ; ij ) as well as of ij, while the frequency of other cases with the same parent con- guration is not augmented. This probability interval contains all and only the possible estimates consistent with D, and therefore, it is sound and it is the tightest estimable interval. The width of each interval is a function of the virtual frequencies, so that it accounts for the amount of information available in D about the parameter to be estimated, and it represents an explicit measure of the quality of probabilistic information conveyed by the database about a parameter. The second step of bc collapses the intervals estimated in the bound step into point estimates using a convex combination of the extreme estimates with weights depending on the assumed pattern of missing data. We assume that information about missing data is encoded as a probability distribution describing, for each variable in the database, the probability of a completion as p(x ik j ij ; X i =?) = ijk, where P k = 1; :::; c i and k ijk = 1. Note that this is only a part of the information required about the distribution of missing data. We have seen that incomplete cases can be either (x ik ; i =?), or (X i =?; ij ) or (X i =?; i =?). Thus, a full description of the pattern of missing data requires the probabilities p( ij j i =?) and p(x ik j ij ; i =?), as well as ijk. We will show that the probabilities ijk can be used to obtain accurate estimates of ijk, if we exclude, amongst the possible patterns of missing data, those extreme mechanisms that yield the lower bounds p (x ik j ij ; D). In order to be able to limit the amount of information about the process underlying missing data, we will assume that if all incomplete case (x ik ; i =?) are completed as (x ik ; ij ), then for all k 6= h, (x ih ; i =?) cannot be completed as (x ih ; ij ). This assumption allows us to derive new local lower bounds from the maximum probabilities as follows. Each maximum probability p (x ik j ij ; D) is obtained when all incomplete cases that could be completed as (x ik ; ij ) are attributed to (x ik ; ij ), and the observed frequencies of the other states of X i given ij are not augmented. Thus, when p(x ik j ij ) = p (x ik j ij ; D), p(x ik j ij ) = p k (x il j ij ; D) for l 6= k where: p k (x il j ij ; D) = ijl + n(x il j ij )? 1 ij + n( ij ) + n (x ik j ij )? c i : The maximum probabilities induce c i probability distributions: fp (x ik j ij ); p k (x ih j ij ; D)g, where k 6= h for k = 1; : : : ; c i. Denote by fp h (x ik j ij ; D)g h 6= k = 1; : : : ; c i the set of c i? 1 local minima of ^ ijk and by p l (x ik j ij ; D) = min h fp h (x ik j ij ; D)g. The distribution of missing entries in terms of ijk can now be used to identify a point estimate ^p(x ik j ij ; D; ijk ) within the interval [p l (x ik j ij ; D); p (x ik j ij ; D)] as: P ci X l6=k ijl p l (x ik j ij ; D) + ijk p (x ik j ij ; D): (2) These estimates dene a probability P distribution since k=1 ^p(x ci ikj ij ; D; ijk ) = k=1 ijk = 1. The intuition behind (2) is that the upper bound of p(x ik ; ij ) is obtained when all incomplete cases are completed as (x ik ; ij ). Thus, if p(x ik j ij ; X i =?) = 1 for a particular k, then (2) will return the upper bound of the interval probability as estimate of p(x ik j ij ), and p k (x ih j ij ; D) as estimates of p(x ih j ij ), h 6= k. This case corresponds to the assumption that data are systematically missing about x ik, i.e. entries are missing with probability 1. On the other hand, when no information on the mechanism generating missing data is available, and therefore all patterns of missing data are equally likely, then ijk = 1=c i. As the number of missing entries decreases, p (x ik j ij ; D) and p h (x ik j ij ; D) approach ( ijk +n(x ijk j ij )?1)=( ij + n( ij )? c i ), so that, when the database is complete,

5 (2) returns the exact estimate ^ ijk. As the number of missing entries increases then p h (x ik j ij ; D)! 0, for all l, and p (x ik j ij ; D)! 1, so that the estimate (2) approaches the prior probability ijk, and coherently nothing is learned from a database in which all entries on (X i ; ij ) are missing. Finally, if n (x ik j ij ) = n ij then (2) simplies to ijk + n(x ik j ij ) + n ij ijk? 1 ij + n( ij ) + n ij? c i (3) which is the expected map, given the assumed pattern of missing data. When data are mar, incomplete samples within parent congurations are representative samples of the complete but unknown ones, so that the probability ijk can be estimated from the available data as ^ ijk = ijk + n(x ik j ij )? 1 ij + n( ij )? c i : (4) Then ^ ijk can be used to compute (2). Thus, when unreported data are mar, bc estimates are corrections of the estimates computed from the observed data. In particular, if n (x ik j ij ) = n then (3) reduces to (1), ij and to the standard mle when ijk = 1 (Little & Rubin, 1987). The estimates computed by (2) with ijk replaced by ^ ijk are the expected estimates and, as the database increases, they will approach the exact ones, as gs and em estimates do. The advantage of bc is that, as a deterministic method, it does not pose any problem of convergence rate and detection, and it reduces the cost of estimating a conditional distribution of X i j ij to the cost of one exact Bayesian updating and one convex combination for each state of X i. Since we consider only the extreme estimates induced by the maximum probabilities, the information needed in the collapse step is limited to the probabilities ijk, and there is no need to specify probabilities of completions for cases in which the parent conguration is unknown. This limited amount of information about the process underlying missing data implies that ^p(x ik j ij ; D; ijk ) p l (x ik j ij ; D) p (x ik j ij ; D). Thus, ^p(x ik j ij ; D; ijk ) cannot be equal to p (x ik j ij ; D) unless p l (x ik j ij ; D) = p (x ik j ij ; D). However, p l ijk + n(x ik j ij )? 1 (x ik j ij ; D) = ij + n( ij ) + max n ; (x ih j ij )? c i h6=k which shows that the dierence between p (x ik j ij ; D) and p l (x ik j ij ; D) depends only on those cases in which the state of the child variable is known and the parent conguration is not. Note that p l (x ik j ij ; D) = p (x ik j ij ; D) if there is no missing data mechanism X1 - X2 - X3 - X6?? X4 -???? X5 Figure 1: The bbn used for the evaluation. consistent with n (x ik j ij ), which is the assumption made above. The dierence p l (x ik j ij ; D)? p (x ik j ij ; D) is positive and it is easy to show that it is maximized when n(x ik j?) = m, for all k, so that it approaches 0 at the same rate of m, and it becomes negligible as the database increases. Furthermore, if we happen to know that, for some missing data mechanism, ^ ijk = p (x ik j ij ; D) for some k, then, for all h 6= k and j, p l (x ih j ij ; D) < p(x ih j ij ) < p (x ih j ij ; D), so that all the other estimates are bounded below by p l (x ih j ij ; D). Thus, any error would be limited to at most one conditional probability for each parent-child conguration. 3. EXPERIMENTAL EVALUATION Aim of these experiments is to evaluate the estimation accuracy of bc, em algorithm and gs as the available information in the database decreases. We used a database reporting the values of six risk factors of coronary diseases in 1841 employees of a Czechoslovakian car factory during a follow up study (Whittaker, 1990, page 261). All variables are binary and they represent Anamnesis (X1), Strenuous Mental Work (X2), Ratio of Beta and Alpha Lipoproteins (X3), Strenuous Physical Work (X4) Smoking (X5) and Systolic Blood Pressure (X6). The database is complete and we extracted the most probable directed graphical model, depicted in Figure 1, using the K2 algorithm (Cooper & Herskovitz, 1992). The 15 conditional dependencies specied by this bbn were then quantied from the complete database. Uniform prior distributions were assumed for the parameters. We then used the bbn in Figure 1 to run two dierent learning tests in order to evaluate the accuracy of bc relative to gs and em, using the implementation of accelerated em in games (Thiesson, 1995), the implementation of gs in bugs (Thomas, Spiegelhalter, & Gilks, 1992) and the implementation of bc in BKD. The threshold for the em was 10?4. Results of gs are based on a rst burn-in of 5,000 iterations, sucient to reach stability, and a successive sample of 5,000 cases. The aim of the rst test was to compare the estimation

6 Estimation Errors Ignorable Non Ignorable 9% 37% 3% 9% gs em bc gs em bc gs em bc gs em bc Min st Qu Median rd Qu Max X Ent Prediction Errors Min st Qu Median rd Qu Max Table 1: Summary statistics of estimation and prediction errors. accuracy of the three methods when the missing data mechanism is ignorable. For this purpose, four incomplete databases were created by incrementally deleting data from the complete database. A vector of 15 numbers between 0 and 1 was randomly generated, and elements of were taken as the probability of deleting the occurrences of each variable X i, independently of its value, given the parent conguration, in the the 10%, 20%, 30% and 40% of the database. This process generated four databases with 9% (1004), 18% (2035), 28% (3041) and 37% (4092) missing entries, and the observed entries within parent congurations are representative of the complete samples. The rationale of the second test was to compare the robustness of these methods with a not ignorable missing data mechanism. Four samples were generated from the complete database by incrementally deleting respectively 25%, 50%, 75% and 100% of the entries (X5 = 2; X6 = 1) with probability 0.9, and (X5 = 2; X6 = 2) with probability 0.1. This process generated 4 databases with 3% (278), 5% (532), 7% (790) and 9% (1030) missing entries. The estimation accuracy is measured by comparing the exact joint probability distribution of (X1; :::; X6) to those learned from the incomplete data using gs, em and bc. Table 1 reports some summary statistics of the results obtained on the two extreme databases generated in the two tests. The rst ve rows of Table 1 report summary statistics, in terms of minimum, maximum, rst, third quartile and median of the absolute dierence between the 64 exact joint probabilities and those obtained from the bbns learned with gs, em and bc. X Ent. labels the cross entropy between joint probability distributions. The predictive accuracy was evaluated by comparing the predictive probabilities of X6 obtained by the three methods to those calculated by the bbn extracted from the complete database. Given the conditional independence assumptions embedded in the bbn there are 43 possible evidences that can be considered: 10 are given by observing one of the variables X1; ; X5 alone, 32 by observing one of the pairs (X1; X3), (X1; X4), (X1; X5), (X2; X3), (X2; X4), (X2; X5), (X3; X4), (X3; X5), and one in which the evidence is the empty set, so that the marginal probability of X6 is returned. Bounds on the predictive probabilities were also computed by propagating bc upper bounds for each conditional dependency. Summary statistics of the absolute errors are given in the second half of Table 1. When 9% of data is missing and the missing data mechanism is ignorable, the distribution of the errors incurred by the three methods diers at the fourth decimal point and a slightly higher maximum value of bc is balanced by a lower median point. The dierence becomes a little more evident when 37% of data are missing. Nonetheless, the median dierence still remains around and, most of all, these dierences do not aect the predictive power of the bbn extracted by bc, which, if anything, scores a slightly better median performance, larger than any dierence among the estimates. This overall equivalence among the three methods is conrmed when missing data are ni, where error distributions are almost identical, and the impact of a wrong assumption about the pattern of missing data equally aect all methods. However, bc provides bounds on the predicted values which reect their reliability and can be taken into account during the reasoning process. It must be also remarked that bc does not rely per se on the ignorability assumption and that the available information about the missing data could

7 Missing gs em bc :15:22 00:00:30 00:00: :23:00 00:01:10 00:00: :29:42 00:01:50 00:00: :37:47 00:01:50 00;00: :29:12 00:02:00 00:00: :09:47 00:03:12 00:00: :56:02 00:04:18 00:00: :26:07 00:07:26 00:00:13 Table 2: Execution time for each method. have been exploited by bc to achieve a better performance. The main dierence among the three methods highlighted by the experiments is the execution time and, most of all, the shape of its growth curve, showing that the execution time of bc is independent of the number of missing data (Ramoni & Sebastiani, 1998). Table 2 shows the execution time for all eight databases generated during the experiments. The rst column reports the absolute number of missing entries from the original entries of the complete database. Results are in hours:minutes:seconds format. 4. CONCLUSIONS The rst character of bc is its ability to explicitly and separately represent the information available in the database and the assumed pattern of missing data, so that bc does not need to rely on the ignorability assumption, although it can be easily represented as any other. Furthermore, bc reduces the cost of estimating each conditional distribution of X i to the cost of one exact Bayesian updating and one convex combination for each state of X i in each parent conguration. This deterministic process does not pose any problem of convergence rate and detection and its computational complexity is independent of the number of missing data. The experimental comparison with em and gs showed a substantial equivalence of the estimates provided by these three methods and a dramatic gain in eciency using bc. Dempster, A., Laird, D., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1{38. Friedman, N., & Goldsmidth, M. (1996). Building classiers using bayesian networks. In Proceedings of the National Conference on Articial Intelligence, pp. 1277{1284 San Mateo, CA. Morgan Kaufman. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721{741. Heckerman, D. (1996). Bayesian networks for knowledge discovery. In Advances in Knowledge Discovery and Data Mining, pp. 153{180. MIT Press, Cambridge, MA. Little, R., & Rubin, D. (1987). Statistical Analysis with Missing Data. Wiley, New York. Ramoni, M., & Sebastiani, P. (1998). Parameter estimation in bayesian networks from incomplete databases. Intelligent Data Analysis Journal, 2 (1). Thiesson, B. (1995). Accelerated quantication of Bayesian networks with incomplete data. In Proceedings of rst international conference on knowledge discovery and data mining, pp. 306{ 11. AAAI press. Thomas, A., Spiegelhalter, D., & Gilks, W. (1992). Bugs: A program to perform Bayesian inference using Gibbs Sampling. In Bayesian Statistics 4, pp. 837{42. Clarendon Press, Oxford. Whittaker, J. (1990). Graphical Models in Applied Multivariate Statistics. Wiley, New York, NY. References Buntine, W. (1994). Operations for learning with graphical models. Journal of Articial Intelligence Research, 2, 159{225. Cooper, G., & Herskovitz, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309{347.

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)