Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources

Size: px
Start display at page:

Download "Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources"

Transcription

1 robabilistic Models to Reconcile Complex Data from Inaccurate Data Sources Lorenzo Blanco, Valter Crescenzi, aolo Merialdo, and aolo apotti Università degli Studi Roma Tre Via della Vasca Navale, 79 Rome, Italy Abstract. Several techniques have been developed to extract and integrate data from web sources. However, web data are inherently imprecise and uncertain. This paper addresses the issue of characterizing the uncertainty of data extracted from a number of inaccurate sources. We develop a probabilistic model to compute a probability distribution for the extracted values, and the accuracy of the sources. Our model considers the presence of sources that copy their contents from other sources, and manages the misleading consensus produced by copiers. We extend the models previously proposed in the literature by working on several attributes at a time to better leverage all the available evidence. We also report the results of several experiments on both synthetic and real-life data to show the effectiveness of the proposed approach. 1 Introduction As the Web is offering increasing amounts of data, important applications can be developed by integrating data provided by a large number of data sources. However, web data are inherently imprecise, and different sources can provide conflicting information. Resolving conflicts and determining what values are likely to be true is a crucial issue to provide trustable reconciled data. Several proposals have been developed to discover the true value from those provided by a large number of conflicting data sources. Some solutions extend a basic vote counting strategy by recognizing that values provided by accurate sources, i.e. sources with low error rates, should be weighted more than those provided by others [10,12]. Recently, principled solutions have been introduced to consider also the presence of sources that copy from other sources [6]. As observed in [1], this is a critical issue, since copiers can cause misleading consensus on false values. However, they still suffer some limitations, as they are based on a model that considers only simple atomic data. Sources are seen as providers that supply data about a collection of objects, i.e. instances of a real world entity, such as a collection of stock quotes. Nevertheless, it is assumed that objects are described by just one attribute, e.g. the price of a stock quote. On the contrary, data sources usually provide complex data, i.e. collections of tuples with many attributes. For example, sources that publish stock quotes always deliver values for price, volume, max and min values, and many other attributes. B. ernici Ed.: CAiSE 2010, LNCS 6051, pp , c Springer-Verlag Berlin Heidelberg 2010

2 84 L. Blanco et al. Existing solutions, focused on a single attribute only, turn out to be rather restrictive, as different attributes, by their very nature, may exhibit drastically different properties and evidence of dependence. This statement is validated by the observation that state-of-the-art algorithms, when executed on real datasets lead to different conclusions if applied on different attributes published by the same web sources. Source 1 volume min max AAL 699.9k GOOG 1.1m YHOO 125k Source 4 volume min max AAL 699.9k GOOG k YHOO 125.0k Source 2 volume min max AAL 699.9k GOOG 1.1m YHOO 125k True values volume min max AAL 699.9k GOOG k YHOO 125.0k Fig. 1. Three sources reporting stock quotes values. Source 3 volume min max AAL GOOG k YHOO 125.0k This behavior can be caused by two main reasons: lack of evidence copiers are missed or misleading evidence false copiers are detected. In Figure 1 we make use of an example to illustrate the issues: four distinct sources report financial data for the same three stocks. For each stock symbol are reported three attributes: volume, min and max values of the stock. The fifth table shows the true values for the considered scenario: such information is not provided in general, in this example we consider it as given to facilitate the discussion. Consider now the first attribute, the stock volume. It is easy to notice that Source 1 and Source 2 are reporting the same false value for the volume of GOOG errors are in bold. Following the intuition from [6], according to which copiers can be detected as the sources share false values, they should be considered as copiers. Conversely, observe that Source 3 and Source 4 report only true values for the volume and therefore there is not any significant evidence of dependence. The scenario radically changes if we look at the other attributes. Source 3 and Source 4 are reporting the same incorrect values for the max attribute, and they also make a common error for the min attribute. Source 4 also reports independently an incorrect value for the min value of AAL. In this scenario our approach concludes that Source 3 and Source 4 are certainly dependent, while the dependency between Source 1 and Source 2 would be very low. Using previous approaches and by looking only at the volume attribute, Source 1 and Source 2 would been reported as copiers because they share the same formatting rule for such data i.e., false copiers detected, while Source 3 and Source 4 would been considered independent sources i.e., real copiers missed. In this paper, we extend previous proposals to deal with sources providing complex data, without introducing any remarkable computation efforts. We formally

3 robabilistic Models to Reconcile Complex Data 85 describe our algorithms and give a detailed comparison with previous proposals. Finally, we show experimentally how the evidence accumulated from several attributes can significantly improve the performance of the existing approaches. aper Outline. The paper is organized as follows: Section 2 illustrates related work. Section 3 describes our probabilistic model to characterize the uncertainty of data in a scenario without copier sources. Section 4 illustrates our model for analyzing the copying relationships on several attributes. Section 5 presents the result of the experimental activity. Section 6 concludes the paper. 2 Related Work Many projects have been active in the study of imprecise databases and have achieved a solid understanding of how to represent uncertain data see [5] for a survey on the topic. The development of effective data integration solutions making use of probabilistic approaches has also been addressed by several projects in the last years. In [8] the redundancy between sources is exploited to gain knowledge, but with a different goal: given a set of text documents they assess the quality of the extraction process. Other works propose probabilistic techniques to integrate data from overlapping sources [9]. On the contrary, so far there has been little focus on how to populate such databases with sound probabilistic data. Even if this problem is strongly application-specific, there is a lack of solutions also in the popular fields of data extraction and integration. Cafarella et al. have described a system to populate a probabilistic database with data extracted from the Web [4], but they do not consider the problems of combining different probability distributions and evaluating the reliability of the sources. TruthFinder [12] was the first project to address the issue of discovering true values in the presence of multiple sources providing conflicting information. It is based on an iterative algorithm that exploits the mutual dependency between source accuracy and consensus among sources. Other approaches, such as [11] and [10], presented fixpoint algorithms to estimate the true values of data reported by a set of sources, together with the accuracy of the sources. These approaches do not consider source dependencies and they all deal with simple data. Some of the intuitions behind TruthFinder were formalized by Dong et al. [6] in a probabilistic Bayesian framework, which also takes into account the effects related to the presence of copiers among the sources. Our probabilistic model is basedonsuchbayesianframeworkandextendsittothecasewithsourcesthat provide complex data. A further development by the same authors also considers the variations of truth values over time [7]. This latter investigation applies for time evolving data and can lead to identify outdated sources. 3 robabilistic Models for Uncertain Web Data In our setting, a source that provides the values of a set of properties for a collection of objects is modeled as a witness that reports an observation. For

4 86 L. Blanco et al. example, on the Web there are several sources that report the values of price, volume, dividend for the NASDAQ stock quotes. We say that these sources are witnesses of all the cited properties for the NASDAQ stock quotes. Different witnesses can report inconsistent observations; that is, they can provide inconsistent values for one or more properties of the same object. We aim at computing: i the probability that the observed properties of an object assume certain values, given a set of observations that refer to that object from a collection of witnesses; ii the accuracy of a witness with respect to each observed property, that is, the probability that a witness provides the correct values of each observed property for a set of objects. With respect to the running example, we aim at computing the probability distributions for volume, min and max values of each observed stock quote, given the observations of the four witnesses illustrated in Figure 1. Also, for each witness, we aim at computing its accuracies in providing a correct value for volume, min and max property. We illustrate two models of increasing complexity. In the first model we assume that each witness provides its observations independently from all the other witnesses independent witnesses assumption. Then, in Section 4, based on the framework developed in [6], we remove this assumption and consider also the presence of witnesses that provide values by copying from other witnesses. The first model is developed considering only one property at a time, as we assume that a witness can exhibit different accuracies for different properties. More properties at a time are taken into account in the second model, which considers also the copying dependencies. As we discussed in the example of Figure 1, considering more properties in this step can greatly affect the results of the other steps, and our experimental results confirmed this intuition, as we report in Section 5. For each property, we use a discrete random variable X to model the possible values it assumes for the observedobject.x = x denotes the prior probability distribution of X on the x 1,...,x n+1 possible values, of which one is correct, denoted x t, and the other n are incorrect. Also, let ẋ denote the event X = x = x t, i.e. the event X assumes the value x, which is the correct value x t.the individual observation of a witness is denoted o; also, vo is used to indicate the reported value. The accuracy of a witness w, denoted A, corresponds to the conditional probability that the witness reports x t,givenẋ; thatis:a = o ẋ, with vo =x t. In the following we make several assumptions. First, we assume that the values provided by a witness for an object are independent on the values provided for the other objects independent values assumption. Also, we assume that the value provided by a witness for a property of an object is independent of the values provided for the other properties of the same object, and that the accuracy of a witness for a property is independent of the property values independent properties assumptions. Finally, for the sake of simplicity, we consider a uniform distribution then X = x = 1 n+1, x, and we assume that the cardinality of the set of values provided by the witnesses is a good estimation for n uniform distribution assumption.

5 robabilistic Models to Reconcile Complex Data 87 Given an object, the larger is the number of witnesses that agree for the same value, the higher is the probability that the value is correct. However, the agreement of the witnesses observations contributes in increasing the probability that a value is correct in a measure that depends also on the accuracy of the involved witnesses. The accuracy of a witness is evaluated by comparing its observations with the observations of other witnesses for a set of objects. A witness that frequently agrees with other witnesses is likely to be accurate. Based on these ideas of mutual dependence between the analysis of the consensus among witnesses and the analysis of the witnesses accuracy, we have developed an algorithm [3] that computes the distribution probabilities for the properties of every observed object and the accuracies of the witnesses. Our algorithm takes as input the observations of some witnesses on multiple properties of a set of objects, and is composed of two main steps: 1. Consensus Analysis: based on the agreement of the witnesses among their observations on individual objects and on the current accuracy of witnesses, compute the probability distribution for the properties of every object Section 3.1; 2. Accuracy Analysis: based on the current probability distributions of the observed object properties, evaluate the accuracy of the witnesses Section 3.2. The iterations are repeated until the accuracies of the witnesses do not significantly change anymore. 3.1 robability Distribution of the Values The following development refers to the computation of the probability distribution for the values of one property of an object, given the observations of several witnesses, and the accuracies of the witnesses with respect to that property. The same process can be applied for every object and property observed by the witnesses. Given a set of witnesses w 1,...,w k, with accuracy A 1,...,A k that report a set of observations o 1,...,o k our goal is to calculate: ẋ k o i ; i.e. we aim at computing the probability distribution of the values an object may assume, given the values reported by k witnesses. First, we can express the desired probability using the Bayes Theorem: ẋ k o i = ẋ k o i ẋ k o i 1 The events ẋ j,withj =1...n+ 1, form a partition of the event space. Thus, according to the Law of Total robability: k o i n+1 = j=1 k ẋj ẋ j o i 2

6 88 L. Blanco et al. Assuming that the observations of all the witnesses are independent, 1 for each event ẋ we can write: k ẋ k ẋ o i = o i 3 Therefore: k ẋ ẋ o i ẋ k o i = n+1 j=1 k ẋ j 4 ẋj o i ẋ is the prior probability that X assumes the value x, then ẋ =ẋ = 1 n+1 ; o i ẋ represents the probability distribution that the i-th witness reports avaluevo i. Observe that if vo i =x t i.e. the witness reports the correct value the term coincides with the accuracy A i of the witness. Otherwise, i.e. if vo i x t, o i ẋ corresponds to the probability that the witness reports an incorrect value. In this case, we assume that vo i has been selected randomly from the n incorrect values of X. Since o i ẋ is a probability distribution: vo i x t o i ẋ =1 A i. Assuming that every incorrect value is selected according to the uniform prior probability distribution, we can conclude: { A i,vo i =x t o i ẋ = 1 A i. 5 n,vo i x t k Combining 4 and 5, we obtain the final expression to compute ẋ o i. 3.2 Witnesses Accuracy We now illustrate the evaluation of the accuracy of the witnesses with respect to one property, given their observations for that property on a set of objects, and the probability distributions associated with the values of each object computed as discussed in the previous section. Our approach is based on the intuition that the accuracy of a witness can be evaluated by considering how its observations for a number of objects agree with those of other witnesses. Indeed, assuming that a number of sources independently report observations about the same property e.g. trade value of a shared set of objects e.g. the NASDAQ stock quotes, these observations unlikely agree by chance. Therefore, the higher are the probabilities of the values reported by a witness, the higher is the accuracy of the witness. 1 This assumption is a simplification of the domain that we will remove later by extending our model to deal with witnesses that may copy.

7 robabilistic Models to Reconcile Complex Data 89 We previously defined the accuracy A i of a witness w i as the probability that w i reports the correct value. Now, given the set of m objects for which the witness w i reports its observations o j i,j =1...m, and the corresponding k probability distributions j ẋ q=1 oj q, computed from the observations of k witnesses with the equation 4, we estimate the accuracy of w i as the average of the probabilities associated with the values reported by w i : A i = 1 m m j=1 j X = vo j i k o j q q=1 where vo j i is the value of the observation reported by w i for the object j. Our algorithm [3] initializes the accuracy of the witnesses to a constant value, then it starts the iteration that computes the probability distribution for the value of every object by using equations 4 and 5 and the accuracy of witnesses equation Witnesses Dependencies over Many roperties We now introduce an extension of the approach developed in [6] for the analysis of dependence among witnesses that removes the independent witnesses assumption. The kind of dependence that we study is due to the presence of copiers: they create artificial consensus which might lead to misleading conclusions. As we consider witnesses that provide several properties for each object, we model the provided values by means of tuples. We assume that a copier either copies a whole tuple from another witness or it does not copy any property at all no-record-linkage assumption. In other words, we assume that a copier is not able to compose one of its tuple by taking values of distinct properties from different sources. Otherwise, note that a record-linkage step would be needed to perform its operation, and it would be more appropriate to consider it as an integration task rather than a copying operation. As in [6], we assume that the dependency between a pair of witnesses is independent of the dependency between any other pair of witnesses; the copiers may provide a copied tuple with a-priori probability 0 c 1, and they may provide some tuples independently from other witnesses with a-priori probability 1 c independent copying assumption. Under these assumptions, the evidence of copying could greatly improve by considering several properties, since it is much less likely that multiple values provided by two witnesses for the same object coincide by chance. 4.1 Modeling Witnesses Dependence In order to deal with the presence of copiers, in the following we exploit the approach presented in [6] to modify the equations obtained with the independency assumption.

8 90 L. Blanco et al. Let W o x be the set of witnesses providing the value x on an object and let W o be the set of witnesses providing any value on the same object, the equation 3 can be rewritten as follows: k o i ẋ = w W ox A w w W 0 W ox 1 A w n = w W ox n A w 1 A w w W o 1 A w n 7 Among all the possible values x 1,...,x n+1, assuming as before a uniform a-priori probability 1 n+1 for each value, the equation 2 can be rewritten as follows: k o i n+1 = j=1 k o i xj x j = 1 n+1 n +1 j=1w W ox j n A w 1 A w w W o 1 A w n The probability that a particular value is true given the observations equation 4, denoted x, can be rewritten as follows: x = ẋ k k o i = o i ẋ k o i 1 n+1 = w W ox n A w 1 A w n+1 n A w 1 A w j=1 w W ox j The denominator is a normalization factor, it is independent of W o x andit will be denoted ω to simplify the notation. For taking into account the witnesses dependency, it is convenient to rewrite x = ecx ω where Cx istheconfidence of x, which is basically the probability expressed according to a logarithmic scale: Cx =ln x+lnω = w W ox If we define the accuracy score of a witness w as: A w =ln n A w 1 A w ln n A w 1 A w it arises that we can express the confidence of a value x as the sum of the accuracy scores of the witnesses that provide that value for independent witnesses: Cx = w W ox We can now drop the independent witnesses assumption; to take into account the presence of copiers the confidence is computed as the weighted sum of the accuracy scores: Cx = A wi w w W ox A w

9 robabilistic Models to Reconcile Complex Data 91 where the weight I w is a number between 0 and 1 that we call the probability of independent opinion of the witness w. It essentially represents which portion of the opinion of w is expressed independently of the other witnesses. For a perfect copier I w equals to 0, whereas for a perfectly independent witness I w equals to 1. I w can be expressed as the probability that a value provided by w is not copied by any other witness: I w = 1 c w w w w where w w is the probability that w is a copier of w,andc is the a-priori probability that a copier actually copies the value provided. Next, we will discuss how to compute a reasonable value of w w fora pair of witnesses. 4.2 Dealing with Many roperties In [6] it is illustrated a technique to compute the probability w 1 w 2 that w 1 is copier of w 2, and the probability w 1 w 2 thatw 1 is independent of w 2 starting from the observations of the values provided by the two witnesses for one given property. Intuitively, the dependence between two witnesses w 1 and w 2 can be detected by analyzing for which objects they provide the same values, and the overall consensus on those values. Indeed, whenever two witnesses provide the same value for an object and the provided value is false, this is an evidence that the two witnesses are copying each other. Much less evidence arises when the two have a common true value for that object: those values could be shared just because both witnesses are accurate, as well as independent. We consider three probabilities, w 1 w 2, w 1 w 2, w 2 w 1, corresponding to a partition of the space of events of the dependencies between two witnesses w 1 and w 2 : either they are dependent or they are independent; if they are dependent, either w 1 copies from w 2 or w 2 copies from w 1. w 1 w 2 Φ = Φ w1 w 2 `w 1 w 2 Φ w1 w 2 `w 1 w 2 + Φ w1 w 2 `w 1 w 2 + Φ w2 w 1 `w 2 w 1 Here Φ corresponds to k o i, i.e. the observations of the values provided by the k witnesses, and namely, o i corresponds to the observation of the tuples provided by the witness w i on the object. The a-priori knowledge of witnesses dependencies can be modeled by considering a parameter 0 <α<1, and then setting the a-priori probability w 1 w 2 to α; w 1 w 2 and w2 w 1 are both set to 1 α A similar discussion for reasons. w 1 w 2 Φ,and w 2 w 1 Φ is omitted for space

10 92 L. Blanco et al. The probabilities Φ w1 w 2, Φ w1 w 2, Φ w2 w 1 can be computed with the help of the independent values assumption: the values independently provided by a witness on different objects are independent of each other. For the sake of simplicity, here we detail how to compute, given the assumptions above, and considering our generative model of witnesses, Φ w1 w 2, i.e. the probability that two independent witnesses w 1 and w 2 provide a certain observation Φ in the case of two properties denoted A and B for which they respectively exhibit error rates 3 of ɛ A 1, ɛ B 1, ɛ A 2, ɛ B 2. A similar development would be possible in the case of witnesses providing more than two properties. Given the set of objects O for which both w 1 and w 2 provide values for properties A and B, it is convenient to partition O in these subsets: O tt O tf O ft O ff O d = O. For objects in O tt O tf O ft O ff, w 1 and w 2 provide the same values of properties A and B, whereas for objects in O d the two witnesses provide different values for at least one property. In the case of objects in O tt, the witnesses agree on the true value for both properties; for objects in O tf they agree on the true value of A and on the same false value of B; similarly for O ft they agree on the same false value of A andonthetruevalueofb; finally, in the case of O ff they agree on the same false values for both properties. We first consider the case of both witnesses independently providing the same values of A and B and these values are either both true or both false. According to the independent properties assumption, w i provides the pair of true values for A and B with probability 1 ɛ A i 1 ɛb i, and a particular pair of false values with probability ɛa i ɛ B i n A n B,withn A respectively n B being the number of possible false values for the property A resp. B. Given that the witnesses are independent, and there are n A n B possible pairs of false values on which the two witnesses may agree, we can write: o O tt w 1 w 2 =1 ɛ A 1 1 ɛa 2 1 ɛb 1 1 ɛb 2 = tt o O ff w 1 w 2 = ɛa 1 ɛa 2 ɛ B 1 ɛb 2 n A n B = ff Awitnessw i independently provides a true value of A and a particular false value for B with probability 1 ɛ A i ɛb i n B similarly for o O ft w 1 w 2 : o O tf w 1 w 2 =1 ɛ A 1 1 ɛ A 2 ɛb 1 ɛb 2 n B o O ft w 1 w 2 =1 ɛ B 1 1 ɛb 2 ɛa 1 ɛa 2 n A = tf = ft All the remaining cases are in O d : o O d w 1 w 2 =1 tt tf ft ff = d 3 The error rate ɛ of a witness with respect to a property is the complement at 1 of its accuracy A with respect to the same property: ɛ =1 A.

11 robabilistic Models to Reconcile Complex Data 93 The independent values assumption allows us to obtain Φ w1 w 2 by multiplying the probabilities and appropriately considering the cardinalities of the corresponding subsets of O: Φ w1 w 2 = Ott tt O tf tf Now we detail how to compute O ft ft O ff ff O d d. Φ w1 w 2, but we omit Φ w2 w 1 since it can be obtained similarly. Recall that, according to our model of copier witnesses, a copier with a-priori probability 1 c provides a tuple independently. In this case, we can reuse the probabilities tt, ff, tf, ft, d obtained above for independent witnesses with weight 1 c. However, with a-priori probability c, a copier witness w 1 provides a tuple copied from the witness w 2 and hence generated according to the same probability distribution function of w 2. For instance, w 2 would generate a pair of true values with probability 1 ɛ A 2 1 ɛ B 2. Concluding: o O tt w 1 w 2 =1 ɛ A 2 1 ɛb 2 c + tt1 c o O ff w 1 w 2 =ɛ A 2 ɛb 2 c + ff1 c o O tf w 1 w 2 =1 ɛ A 2 ɛ B 2 c + tf 1 c o O ft w 1 w 2 =1 ɛ B 2 ɛ A 2 c + ft 1 c For the remaining cases, we have to consider that since the witnesses are providing different values for the same object, it cannot be the case that one is copying the other. o O d w 1 w 2 =1 tt tf ft ff 1 c Again, the independent values assumption allows us to obtain Φ w1 w 2 by multiplying these probabilities raised to the cardinality of the corresponding subset of O. 5 Experiments We now describe the settings and the data we used for the experimental evaluation of the proposed approach. We conducted two sets of experiments. The first set of experiments was done with generated synthetic data, while the second set was performed with real world data extracted from the Web. For the following experiments we set α=0.2 and c= Synthetic Scenarios The goal of the experiments with synthetic data was to analyze how the algorithms perform with sources of different quality. We conducted two sets of experiments EX1 and EX2 to study the performances of the approach with different configurations as summarized in Figure 2. In the two sets there are three possible types of sources: authorities, which provide true values for every object and every attribute; independents, which make

12 94 L. Blanco et al. #authorities #independents #copiers A EX EX Fig. 2. Configurations for the synthetic scenarios mistakes according to the source accuracy A; copiers, which copy according to a copying rate r from the independents, and make mistakes according to the source accuracy A when they report values independently. The experiments aim at studying the influence of the varying source accuracies and the presence of an authority source notice that the authority is not given: the goal of the experiments is to detect it. In all the experiments we generated sources with N =100 objects, each described by a tuple with 5 attributes with values for all the objects; the copiers copy from an independent source with a frequency r =0.8. In all the scenarioseach copier copies from three independents, with the following probabilities: 0.3, 0.3, 0.4. In order to evaluate the influence of complex data, for each of these configurations we varied the number of attributes given as input to the algorithm with three combinations: 1, 3, and 5 attributes. We remark that our implementation coincides with the current state of the art when only one attribute is considered [6]. To highlight the effectiveness of our algorithm, we also compared our solution with a naive approach, in which the probability distribution is computed with a simple voting strategy, ignoring the accuracy of the sources. To evaluate the performance of the algorithms we report the recision, i.e. the fraction of objects on which we select the true values, considering as candidate true values the ones with the highest probability. Results. The results of our experiments on the synthetic scenarios are illustrated in Figure 3. For each set of experiments we randomly generated the datasets and applied the algorithms 100 times. We report a graphic with the average recision for the naive execution and for the three different executions of our approach. We used in fact executions of MultiAtt1 with only one attribute given as input, executions of MultiAtt3 with three attributes, and executions of MultiAtt5 with five. From the two sets it is apparent that the executions with multiple attributes always outperform the naive execution and the one considering only one attribute. In the first set EX1, MultiAtt3 and MultiAtt5 present some benefits compared to previous solutions, but are not able to obtain excellent precision in presence of high error rates. This is not surprising: even if MultiAtt3 and MultiAtt5 are able to identify perfectly what sources are copiers, there are 8 independent sources reporting true values with a very low frequency and for most of the objects the evidence needed to compute the true values is missing. The scenario radically changes in EX2, where an authority exists and MultiAtt5 is able to return all the correct values even for the worst case, while MultiAtt3 and MultiAtt1 start significantly mixing dependencies at 0.8 and 0.5 error rates, respectively.

13 robabilistic Models to Reconcile Complex Data 95 Fig. 3. Synthetic experiments: MultiAtt5 outperforms alterative configurations in all scenarios It is worth remarking that our algorithm does not introduce regressions with respect to previous solutions. In fact, we have been able to run all the synthetic examples in [6] obtaining the same results with all the configurations of MultiAtt. This can be explained by observing that in those examples the number of copiers is minor than the number of independent sources and MultiAtt1 suffices for computing correctly all the dependencies. In the following, we will show that real data are significantly affected by the presence of copiers, but there are cases where considering only one attribute does not suffice to find the correct dependencies between sources. 5.2 Real-World Web Data We used collections of data extracted from web sites about NASDAQ stock quotes. 4 All the extraction rules were checked manually, and the pages were downloaded on November 19th The settings for the real-world experiments are reported in Figure 4, which shows the list of attributes we studied. Among hundreds of available stock quotes we have chosen the subset that maximizes the inconsistency between sources. It is worth observing that in this domain an authority exists: it is the official NASDAQ website We ran our algorithm over the available data and we evaluated the results considering the data published by that source as the truth. The experiments were executed on a FreeBSD machine with Intel Core Duo 2.16GHz CU and 2GB memory. To test the effectiveness of our approach we executed the algorithm considering one attribute at a time, considering all the 10 possible configurations of three attributes, and, finally, considering five attributes at the same time. In Figure 5.a are reported the average of the precisions obtained over the five attributes by these configurations. The worst average precision 0.39 is obtained considering only one attribute at a time: this is due to the lack of clear majorities in the 4 We relied on Flint, a system for the automatic extraction of web data [2]. 5 Since financial data change during the trading sessions, we downloaded the pages while the markets were closed.

14 96 L. Blanco et al. Attribute #sites %null #symbols #objects last price open price week high week low volume Fig. 4. Settings for the real-world experiments a b Fig. 5. Real-world experiments results setting and the consequent difficulty in the discovery of the dependencies. We obtained interesting results considering the configurations of three attributes. In fact, it turned out that some configurations perform significantly better than others. This is not surprising, since the quality of the data exposed by an attribute can be more or less useful in the computation of the dependencies: for example, an attribute does not provide information to identify copiers if either all the sources provide the correct values or all the sources provide different values. However, it is encouraging to notice that considering all the five attributes we obtained a good precision This shows that even if there exist attributes that do not contribute positively or provide misleading information, their impact can be absorbed if they are considered together with the good ones. Figure 5.b reports the average precision scores for the three configurations compared with their execution times the average in the cases with one and three attributes. It can be observed that the execution times increase linearly with the number of attributes involved in the computation, with a maximum of 16 minutes for the configuration with five attributes. 6 Conclusions and Future Work We developed an extension of existing solutions for reconciling conflicting data from inaccurate sources. Our extension takes into account complex data, i.e.

15 robabilistic Models to Reconcile Complex Data 97 tuples, instead of atomic values. Our work shows that the dependence analysis is the point in which the state-of-the-art models can be extended to analyze several properties at a time. Experiments showed that our extension can greatly affect the overall results. We are currently studying further developments of the model. First, we are investigating an extension of the model beyond the uniform distribution assumption. Second, we are studying more complex forms of dependencies such as those implied by integration processes that include a record linkage step. References 1. Berti-Equille, L., Sarma, A.D., Dong, X., Marian, A., Srivastava, D.: Sailing the information ocean with awareness of currents: Discovery and application of source dependence. In: CIDR Blanco, L., Crescenzi, V., Merialdo,., apotti,.: Flint: Google-basing the web. In: EDBT Blanco, L., Crescenzi, V., Merialdo,., apotti,.: A probabilistic model to characterize the uncertainty of web data integration: What sources have the good data? Technical report, DIA - Roma Tre - TR146 June Cafarella, M.J., Etzioni, O., Suciu, D.: Structured queries over web text. IEEE Data Eng. Bull. 294, Dalvi, N.N., Suciu, D.: Management of probabilistic data: foundations and challenges. In: ODS, pp Dong, X.L., Berti-Equille, L., Srivastava, D.: Integrating conflicting data: The role of source dependence. VLDB 21, Dong, X.L., Berti-Equille, L., Srivastava, D.: Truth discovery and copying detection in a dynamic world. VLDB 21, Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information extraction. In: IJCAI, pp Florescu, D., Koller, D., Levy, A.Y.: Using probabilistic information in data integration. In: VLDB, pp Galland, A., Abiteboul, S., Marian, A., Senellart,.: Corroborating information from disagreeing views. In: roc. WSDM, New York, USA Wu, M., Marian, A.: Corroborating answers from multiple web sources. In: WebDB Yin, X., Han, J., Yu,.S.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 206,

Corroborating Information from Disagreeing Views

Corroborating Information from Disagreeing Views Corroboration A. Galland WSDM 2010 1/26 Corroborating Information from Disagreeing Views Alban Galland 1 Serge Abiteboul 1 Amélie Marian 2 Pierre Senellart 3 1 INRIA Saclay Île-de-France 2 Rutgers University

More information

The Veracity of Big Data

The Veracity of Big Data The Veracity of Big Data Pierre Senellart École normale supérieure 13 October 2016, Big Data & Market Insights 23 April 2013, Dow Jones (cnn.com) 23 April 2013, Dow Jones (cnn.com) 23 April 2013, Dow Jones

More information

Leveraging Data Relationships to Resolve Conflicts from Disparate Data Sources. Romila Pradhan, Walid G. Aref, Sunil Prabhakar

Leveraging Data Relationships to Resolve Conflicts from Disparate Data Sources. Romila Pradhan, Walid G. Aref, Sunil Prabhakar Leveraging Data Relationships to Resolve Conflicts from Disparate Data Sources Romila Pradhan, Walid G. Aref, Sunil Prabhakar Fusing data from multiple sources Data Item S 1 S 2 S 3 S 4 S 5 Basera 745

More information

arxiv: v1 [cs.db] 14 May 2017

arxiv: v1 [cs.db] 14 May 2017 Discovering Multiple Truths with a Model Furong Li Xin Luna Dong Anno Langen Yang Li National University of Singapore Google Inc., Mountain View, CA, USA furongli@comp.nus.edu.sg {lunadong, arl, ngli}@google.com

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Can Vector Space Bases Model Context?

Can Vector Space Bases Model Context? Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval

More information

Online Data Fusion. Xin Luna Dong AT&T Labs-Research Xuan Liu National University of Singapore

Online Data Fusion. Xin Luna Dong AT&T Labs-Research Xuan Liu National University of Singapore Online Data Fusion ABSTRACT Xuan Liu National University of Singapore liuxuan@comp.nus.edu.sg Beng Chin Ooi National University of Singapore ooibc@comp.nus.edu.sg The Web contains a significant volume

More information

Fusing Data with Correlations

Fusing Data with Correlations Fusing Data with Correlations Ravali Pochampally University of Massachusetts ravali@cs.umass.edu Alexandra Meliou University of Massachusetts ameli@cs.umass.edu Anish Das Sarma Troo.Ly Inc. anish.dassarma@gmail.com

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

The Web is Great. Divesh Srivastava AT&T Labs Research

The Web is Great. Divesh Srivastava AT&T Labs Research The Web is Great Divesh Srivastava AT&T Labs Research A Lot of Information on the Web Information Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

More information

Discovering Truths from Distributed Data

Discovering Truths from Distributed Data 217 IEEE International Conference on Data Mining Discovering Truths from Distributed Data Yaqing Wang, Fenglong Ma, Lu Su, and Jing Gao SUNY Buffalo, Buffalo, USA {yaqingwa, fenglong, lusu, jing}@buffalo.edu

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

A new Approach to Drawing Conclusions from Data A Rough Set Perspective

A new Approach to Drawing Conclusions from Data A Rough Set Perspective Motto: Let the data speak for themselves R.A. Fisher A new Approach to Drawing Conclusions from Data A Rough et Perspective Zdzisław Pawlak Institute for Theoretical and Applied Informatics Polish Academy

More information

COMPUTATIONAL LEARNING THEORY

COMPUTATIONAL LEARNING THEORY COMPUTATIONAL LEARNING THEORY XIAOXU LI Abstract. This paper starts with basic concepts of computational learning theory and moves on with examples in monomials and learning rays to the final discussion

More information

Treatment of Error in Experimental Measurements

Treatment of Error in Experimental Measurements in Experimental Measurements All measurements contain error. An experiment is truly incomplete without an evaluation of the amount of error in the results. In this course, you will learn to use some common

More information

1 Measurement Uncertainties

1 Measurement Uncertainties 1 Measurement Uncertainties (Adapted stolen, really from work by Amin Jaziri) 1.1 Introduction No measurement can be perfectly certain. No measuring device is infinitely sensitive or infinitely precise.

More information

Influence-Aware Truth Discovery

Influence-Aware Truth Discovery Influence-Aware Truth Discovery Hengtong Zhang, Qi Li, Fenglong Ma, Houping Xiao, Yaliang Li, Jing Gao, Lu Su SUNY Buffalo, Buffalo, NY USA {hengtong, qli22, fenglong, houpingx, yaliangl, jing, lusu}@buffaloedu

More information

Data Dependencies in the Presence of Difference

Data Dependencies in the Presence of Difference Data Dependencies in the Presence of Difference Tsinghua University sxsong@tsinghua.edu.cn Outline Introduction Application Foundation Discovery Conclusion and Future Work Data Dependencies in the Presence

More information

On Real-time Monitoring with Imprecise Timestamps

On Real-time Monitoring with Imprecise Timestamps On Real-time Monitoring with Imprecise Timestamps David Basin 1, Felix Klaedtke 2, Srdjan Marinovic 1, and Eugen Zălinescu 1 1 Institute of Information Security, ETH Zurich, Switzerland 2 NEC Europe Ltd.,

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Tanagra Tutorials. The classification is based on a good estimation of the conditional probability P(Y/X). It can be rewritten as follows.

Tanagra Tutorials. The classification is based on a good estimation of the conditional probability P(Y/X). It can be rewritten as follows. 1 Topic Understanding the naive bayes classifier for discrete predictors. The naive bayes approach is a supervised learning method which is based on a simplistic hypothesis: it assumes that the presence

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

Change-point models and performance measures for sequential change detection

Change-point models and performance measures for sequential change detection Change-point models and performance measures for sequential change detection Department of Electrical and Computer Engineering, University of Patras, 26500 Rion, Greece moustaki@upatras.gr George V. Moustakides

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering

Reducing False Alarm Rate in Anomaly Detection with Layered Filtering Reducing False Alarm Rate in Anomaly Detection with Layered Filtering Rafa l Pokrywka 1,2 1 Institute of Computer Science AGH University of Science and Technology al. Mickiewicza 30, 30-059 Kraków, Poland

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project

1 Introduction. 2 AIC versus SBIC. Erik Swanson Cori Saviano Li Zha Final Project Erik Swanson Cori Saviano Li Zha Final Project 1 Introduction In analyzing time series data, we are posed with the question of how past events influences the current situation. In order to determine this,

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

First Steps Towards a CPU Made of Spiking Neural P Systems

First Steps Towards a CPU Made of Spiking Neural P Systems Int. J. of Computers, Communications & Control, ISSN 1841-9836, E-ISSN 1841-9844 Vol. IV (2009), No. 3, pp. 244-252 First Steps Towards a CPU Made of Spiking Neural P Systems Miguel A. Gutiérrez-Naranjo,

More information

Approximating Global Optimum for Probabilistic Truth Discovery

Approximating Global Optimum for Probabilistic Truth Discovery Approximating Global Optimum for Probabilistic Truth Discovery Shi Li, Jinhui Xu, and Minwei Ye State University of New York at Buffalo {shil,jinhui,minweiye}@buffalo.edu Abstract. The problem of truth

More information

Classification Based on Logical Concept Analysis

Classification Based on Logical Concept Analysis Classification Based on Logical Concept Analysis Yan Zhao and Yiyu Yao Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada S4S 0A2 E-mail: {yanzhao, yyao}@cs.uregina.ca Abstract.

More information

Decomposing and Pruning Primary Key Violations from Large Data Sets

Decomposing and Pruning Primary Key Violations from Large Data Sets Decomposing and Pruning Primary Key Violations from Large Data Sets (discussion paper) Marco Manna, Francesco Ricca, and Giorgio Terracina DeMaCS, University of Calabria, Italy {manna,ricca,terracina}@mat.unical.it

More information

Combining Memory and Landmarks with Predictive State Representations

Combining Memory and Landmarks with Predictive State Representations Combining Memory and Landmarks with Predictive State Representations Michael R. James and Britton Wolfe and Satinder Singh Computer Science and Engineering University of Michigan {mrjames, bdwolfe, baveja}@umich.edu

More information

Selection of Classifiers based on Multiple Classifier Behaviour

Selection of Classifiers based on Multiple Classifier Behaviour Selection of Classifiers based on Multiple Classifier Behaviour Giorgio Giacinto, Fabio Roli, and Giorgio Fumera Dept. of Electrical and Electronic Eng. - University of Cagliari Piazza d Armi, 09123 Cagliari,

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial C:145 Artificial Intelligence@ Uncertainty Readings: Chapter 13 Russell & Norvig. Artificial Intelligence p.1/43 Logic and Uncertainty One problem with logical-agent approaches: Agents almost never have

More information

Limitations of Efficient Reducibility to the Kolmogorov Random Strings

Limitations of Efficient Reducibility to the Kolmogorov Random Strings Limitations of Efficient Reducibility to the Kolmogorov Random Strings John M. HITCHCOCK 1 Department of Computer Science, University of Wyoming Abstract. We show the following results for polynomial-time

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Process Mining in Non-Stationary Environments

Process Mining in Non-Stationary Environments and Machine Learning. Bruges Belgium), 25-27 April 2012, i6doc.com publ., ISBN 978-2-87419-049-0. Process Mining in Non-Stationary Environments Phil Weber, Peter Tiňo and Behzad Bordbar School of Computer

More information

ECE521 Lecture 7/8. Logistic Regression

ECE521 Lecture 7/8. Logistic Regression ECE521 Lecture 7/8 Logistic Regression Outline Logistic regression (Continue) A single neuron Learning neural networks Multi-class classification 2 Logistic regression The output of a logistic regression

More information

Relevance Vector Machines for Earthquake Response Spectra

Relevance Vector Machines for Earthquake Response Spectra 2012 2011 American American Transactions Transactions on on Engineering Engineering & Applied Applied Sciences Sciences. American Transactions on Engineering & Applied Sciences http://tuengr.com/ateas

More information

UnSAID: Uncertainty and Structure in the Access to Intensional Data

UnSAID: Uncertainty and Structure in the Access to Intensional Data UnSAID: Uncertainty and Structure in the Access to Intensional Data Pierre Senellart 3 July 214, Univ. Rennes 1 Uncertain data is everywhere Numerous sources of uncertain data: Measurement errors Data

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Statistical Ranking Problem

Statistical Ranking Problem Statistical Ranking Problem Tong Zhang Statistics Department, Rutgers University Ranking Problems Rank a set of items and display to users in corresponding order. Two issues: performance on top and dealing

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

2012 Assessment Report. Mathematics with Calculus Level 3 Statistics and Modelling Level 3

2012 Assessment Report. Mathematics with Calculus Level 3 Statistics and Modelling Level 3 National Certificate of Educational Achievement 2012 Assessment Report Mathematics with Calculus Level 3 Statistics and Modelling Level 3 90635 Differentiate functions and use derivatives to solve problems

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology SE lecture revision 2013 Outline 1. Bayesian classification

More information

SOLVING POWER AND OPTIMAL POWER FLOW PROBLEMS IN THE PRESENCE OF UNCERTAINTY BY AFFINE ARITHMETIC

SOLVING POWER AND OPTIMAL POWER FLOW PROBLEMS IN THE PRESENCE OF UNCERTAINTY BY AFFINE ARITHMETIC SOLVING POWER AND OPTIMAL POWER FLOW PROBLEMS IN THE PRESENCE OF UNCERTAINTY BY AFFINE ARITHMETIC Alfredo Vaccaro RTSI 2015 - September 16-18, 2015, Torino, Italy RESEARCH MOTIVATIONS Power Flow (PF) and

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Lies My Calculator and Computer Told Me

Lies My Calculator and Computer Told Me Lies My Calculator and Computer Told Me 2 LIES MY CALCULATOR AND COMPUTER TOLD ME Lies My Calculator and Computer Told Me See Section.4 for a discussion of graphing calculators and computers with graphing

More information

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events Massimo Franceschet Angelo Montanari Dipartimento di Matematica e Informatica, Università di Udine Via delle

More information

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks

Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks Alessandro Antonucci Marco Zaffalon IDSIA, Istituto Dalle Molle di Studi sull

More information

Sequential Decision Problems

Sequential Decision Problems Sequential Decision Problems Michael A. Goodrich November 10, 2006 If I make changes to these notes after they are posted and if these changes are important (beyond cosmetic), the changes will highlighted

More information

Advances in promotional modelling and analytics

Advances in promotional modelling and analytics Advances in promotional modelling and analytics High School of Economics St. Petersburg 25 May 2016 Nikolaos Kourentzes n.kourentzes@lancaster.ac.uk O u t l i n e 1. What is forecasting? 2. Forecasting,

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases

Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases Distributed Clustering and Local Regression for Knowledge Discovery in Multiple Spatial Databases Aleksandar Lazarevic, Dragoljub Pokrajac, Zoran Obradovic School of Electrical Engineering and Computer

More information

Quantum Computation via Sparse Distributed Representation

Quantum Computation via Sparse Distributed Representation 1 Quantum Computation via Sparse Distributed Representation Gerard J. Rinkus* ABSTRACT Quantum superposition states that any physical system simultaneously exists in all of its possible states, the number

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

More information

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems Daniel Meyer-Delius 1, Christian Plagemann 1, Georg von Wichert 2, Wendelin Feiten 2, Gisbert Lawitzky 2, and

More information

Short Note: Naive Bayes Classifiers and Permanence of Ratios

Short Note: Naive Bayes Classifiers and Permanence of Ratios Short Note: Naive Bayes Classifiers and Permanence of Ratios Julián M. Ortiz (jmo1@ualberta.ca) Department of Civil & Environmental Engineering University of Alberta Abstract The assumption of permanence

More information

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan Bayesian Learning CSL603 - Fall 2017 Narayanan C Krishnan ckn@iitrpr.ac.in Outline Bayes Theorem MAP Learners Bayes optimal classifier Naïve Bayes classifier Example text classification Bayesian networks

More information

An Experimental Evaluation of Passage-Based Process Discovery

An Experimental Evaluation of Passage-Based Process Discovery An Experimental Evaluation of Passage-Based Process Discovery H.M.W. Verbeek and W.M.P. van der Aalst Technische Universiteit Eindhoven Department of Mathematics and Computer Science P.O. Box 513, 5600

More information

Measurements and Data Analysis

Measurements and Data Analysis Measurements and Data Analysis 1 Introduction The central point in experimental physical science is the measurement of physical quantities. Experience has shown that all measurements, no matter how carefully

More information

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Giuliano Armano, Francesca Fanni and Alessandro Giuliani Dept. of Electrical and Electronic Engineering, University

More information

Classification of Ordinal Data Using Neural Networks

Classification of Ordinal Data Using Neural Networks Classification of Ordinal Data Using Neural Networks Joaquim Pinto da Costa and Jaime S. Cardoso 2 Faculdade Ciências Universidade Porto, Porto, Portugal jpcosta@fc.up.pt 2 Faculdade Engenharia Universidade

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics

Types of spatial data. The Nature of Geographic Data. Types of spatial data. Spatial Autocorrelation. Continuous spatial data: geostatistics The Nature of Geographic Data Types of spatial data Continuous spatial data: geostatistics Samples may be taken at intervals, but the spatial process is continuous e.g. soil quality Discrete data Irregular:

More information

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Eun Yong Kang Department

More information

Fast and Accurate Causal Inference from Time Series Data

Fast and Accurate Causal Inference from Time Series Data Fast and Accurate Causal Inference from Time Series Data Yuxiao Huang and Samantha Kleinberg Stevens Institute of Technology Hoboken, NJ {yuxiao.huang, samantha.kleinberg}@stevens.edu Abstract Causal inference

More information

Deceptive Advertising with Rational Buyers

Deceptive Advertising with Rational Buyers Deceptive Advertising with Rational Buyers September 6, 016 ONLINE APPENDIX In this Appendix we present in full additional results and extensions which are only mentioned in the paper. In the exposition

More information

Robust methodology for TTS enhancement evaluation

Robust methodology for TTS enhancement evaluation Robust methodology for TTS enhancement evaluation Daniel Tihelka, Martin Grůber, and Zdeněk Hanzlíček University of West Bohemia, Faculty of Applied Sciences, Dept. of Cybernetics Univerzitni 8, 36 4 Plzeň,

More information

Chapter Three. Hypothesis Testing

Chapter Three. Hypothesis Testing 3.1 Introduction The final phase of analyzing data is to make a decision concerning a set of choices or options. Should I invest in stocks or bonds? Should a new product be marketed? Are my products being

More information

Encoding formulas with partially constrained weights in a possibilistic-like many-sorted propositional logic

Encoding formulas with partially constrained weights in a possibilistic-like many-sorted propositional logic Encoding formulas with partially constrained weights in a possibilistic-like many-sorted propositional logic Salem Benferhat CRIL-CNRS, Université d Artois rue Jean Souvraz 62307 Lens Cedex France benferhat@criluniv-artoisfr

More information

Probabilistic Field Mapping for Product Search

Probabilistic Field Mapping for Product Search Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012 CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,

More information

Improved Algorithms for Module Extraction and Atomic Decomposition

Improved Algorithms for Module Extraction and Atomic Decomposition Improved Algorithms for Module Extraction and Atomic Decomposition Dmitry Tsarkov tsarkov@cs.man.ac.uk School of Computer Science The University of Manchester Manchester, UK Abstract. In recent years modules

More information

MATHEMATICS Paper 980/11 Paper 11 General comments It is pleasing to record improvement in some of the areas mentioned in last year s report. For example, although there were still some candidates who

More information

Recoverabilty Conditions for Rankings Under Partial Information

Recoverabilty Conditions for Rankings Under Partial Information Recoverabilty Conditions for Rankings Under Partial Information Srikanth Jagabathula Devavrat Shah Abstract We consider the problem of exact recovery of a function, defined on the space of permutations

More information

LECSS Physics 11 Introduction to Physics and Math Methods 1 Revised 8 September 2013 Don Bloomfield

LECSS Physics 11 Introduction to Physics and Math Methods 1 Revised 8 September 2013 Don Bloomfield LECSS Physics 11 Introduction to Physics and Math Methods 1 Physics 11 Introduction to Physics and Math Methods In this introduction, you will get a more in-depth overview of what Physics is, as well as

More information

arxiv: v3 [cs.lg] 25 Aug 2017

arxiv: v3 [cs.lg] 25 Aug 2017 Achieving Budget-optimality with Adaptive Schemes in Crowdsourcing Ashish Khetan and Sewoong Oh arxiv:602.0348v3 [cs.lg] 25 Aug 207 Abstract Crowdsourcing platforms provide marketplaces where task requesters

More information

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Computer Science Department and Center for the Neural Basis of Cognition Carnegie Mellon University

More information

10. Time series regression and forecasting

10. Time series regression and forecasting 10. Time series regression and forecasting Key feature of this section: Analysis of data on a single entity observed at multiple points in time (time series data) Typical research questions: What is the

More information