Probabilistic Similarity Query on Dimension Incomplete Data

Size: px

Start display at page:

Download "Probabilistic Similarity Query on Dimension Incomplete Data"

Zoe Norton
6 years ago
Views:

2009 Ninth IEEE International Conference on Data Mining Probabilistic Similarity Query on Dimension Incomplete Data Wei Cheng School of Software Tsinghua University chengw07@mails.tsinghua.edu.

1 2009 Ninth IEEE International Conference on Data Mining Probabilistic Similarity Query on Dimension Incomplete Data Wei Cheng School of Software Tsinghua University Xiaoming Jin School of Software Tsinghua University Jian-Tao Sun Microsoft Research Asia Abstract Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data. Introduction Multidimensional data, such as time series and feature vectors extracted from images, are widely used in various applications. Similarity query on multidimensional data (i.e., to retrieve similar data objects from multidimensional database given a data object as an input query) has attracted many research interests as it plays an important role in many data mining, database and information retrieval tasks. This problem is challenging when the data is incomplete [, 4, 23], which may be caused by various reasons. For example, in sensor network applications, the data collected may become incomplete when sensors do not work properly or when errors occur during data transfer process. In the literature, data incompleteness problem has been well researched (e.g., see [7, 5]). In these works, data incompleteness usually refers to missing value: data values for some dimensions are unknown (or uncertain), but it is known that, for each dimension, if the corresponding data value is missing or not. However, in practice, it is quite common that we do not know which dimensions (or positions) have data loss [9]. In other words, the dimensionality of the collected data is lower than its actual dimensionality, and we lose the correspondence relationship between data dimensions and their associated values. This is regarded as dimension incompleteness in this work. Take sensor network application as an example, the database usually contains time series data objects, each of which is represented by a sequence of values x, x 2,..., x n. The dimension information (e.g., time stamp) associated with data elements can be implicitly inferred from the order of data arrival. This schema of data collection and storage is very common in resource-constrained applications since explicitly maintaining dimension information will cause additional costs. Therefore, even missing a single data element will destroy the dimension information of the entire data object. Besides, in some applications where dimension information is explicitly maintained, the dimension indicator itself may also be lost which will cause the data to become dimension incomplete. More generally, for some data sets containing time series with various lengths, it can be assumed that these data sets are originally generated with the same dimensionality and then some dimensions are lost. So the dimension incomplete data is quite common in practical applications /09 $ IEEE DOI 0.09/ICDM

2 The incompleteness of data dimension brings challenges to similarity query task, as dimension information is essential for existing techniques used to handle uncertain data [6, 0]. Given a query object and a dimension incomplete data object, the similarity measurement between them is fundamental for similarity query task but becomes impractical because their dimensions do not match. For instance, the widely used L p -norms distance, L p (x, y) = ( n i= x i y i p ) /p, can not be calculated in this case because it does not allow shifting of data dimension. Other similarity computation methods like DTW [3] and LCSS [3] are mainly proposed to better measure the similarity between two data objects without considering dimension incompleteness. They can not be used to deal with dimension incomplete data, as their underlying matching strategies are not designed to capture the characteristics of data in dimension incomplete scenario. One straightforward solution is to consider all possible dimension missing cases and calculate the similarity accordingly. But this procedure may become extremely time consuming. Assume the original dimensionality of data objects and query is m, but only n dimensions of a data object are observed (n<m), there are Cm n possible dimension combinations to be examined when the similarity is computed. Thus better solutions are needed to deal with this problem. In this paper, we model the problem of similarity query on dimension incomplete data with a probabilistic framework. Based on our framework, users can give a distance threshold to specify the allowed distance between the query and dimension incomplete data objects, and a probability confidence threshold to specify the requirement that the retrieval results should satisfy the distance condition at least with the given probability. Our query approach is based on the fact that the relationship between query and dimension incomplete data objects can be referred. Such relationship information may provide helpful guides for performing similarity query task. An efficient method is proposed to find lower and upper bounds of the probability that a data object satisfies the query. These bounds can be used to () eliminate data objects that are judged as dismissals, and (2) keep qualified ones in O(n(m n) 2 ) time. Furthermore, based on the proposed probability triangle inequality, an approach with time complexity O(m) is introduced to further speed up the similarity query process. Our proposed method is proved to be theoretically correct. Experiments on two real data sets indicate that our method is promising in doing similarity query on dimension incomplete data. 2 Related work The problem of missing data values has been well researched in the literature [5, 8, 8, 22]. In their research, the dimensions with data missing are known and their corresponding values are estimated [5]. There are also some works that deal with data uncertainty (e.g., [20, 6, 7, 4, 2,, 9, 2]), which is related to but different from dimension incompleteness problem. These works consider the uncertainty of data values and estimate a probability density function (pdf) to model the uncertainty of values for data elements. For example, it is addressed in [7] that the recorded value is likely to be different from the actual value. In such cases, queries may also be uncertain. These works differ from ours in that they only consider the uncertainty of data values, while we consider the uncertainty of dimension as well. In [9], missing data elements in symbolic sequence is addressed. In this work, a more general and more challenging problem is studied. We consider real-valued multidimensional data and address the probabilistic query task, both of which are of essential difference from [9]. The problem of similarity query on dimension incomplete data cannot be solved by algorithms like Dynamic Time Warping (DTW) [3, 2] or Longest Common Subsequence distance (LCSS) [3]. DTW and LCSS are designed to target a goal different from ours. They match two multidimensional data objects focusing on their common things, not for measuring the similarity between objects with missing dimension information. 3 Problem description and analysis D={X, X 2,..., X N } is a database containing the multidimensional data. A data object X from D is a real-valued vector: (x, x 2,..., x M ), where x m ( m M) is the data value for the m-th dimension of X. X = M denotes the dimensionality of X. D is said to be incomplete if its data objects are allowed to have missing values or dimensions. Otherwise, D is complete. In this work, one data object is regarded as dimension incomplete, if it satisfies that (a) at least one of its data elements is missing; (b) the dimension of the missing data element cannot be determined. For example, given a complete data object X, if its k data elements are missing, the resulting dimension incomplete data is of the form X obs = (x n, x n2,..., x nm ) where n j < n j+, M = X k. Conventional range query in a multidimensional database is defined as: given a database D containing N data objects of M dimensions, an M-dimensional query Q, a query threshold r and a distance functionδ, to retrieve all the data objects in the database D that have a distance away from Q less than r: RangeQuery δ (D, Q, r)={x D δ(q, X)< r} () Apparently, it is not practical to measure the exact similarity between dimension incomplete data object and query, because the associated dimensions cannot be well aligned. 82

3 Thus the similarity score is uncertain and will depend on both the dimension alignment and the values estimated for the missing dimensions. In this work, we address the probabilistic similarity query problem on dimension incomplete database: to retrieve data objects from the database with high probability of satisfying the input query. This problem can be formulated as follows: Definition (Probabilistic Similarity Query on Dimension Incomplete Data (PSQ-DID)). Given a database D containing dimension incomplete multidimensional data objects X obs whose underlying complete version is denoted by X, a query Q that is complete data, a distance threshold r, a confidence threshold c, an imputation method ϕ indicating the distribution of missing data values, and a distance function δ, PSQ DID δ,ϕ (D, Q, r, c)={x obs D P[δ(Q, X)< r]>c} (2) P[δ(Q, X) < r], termed confidence in this paper, indicates the probability that the underlying complete data object X satisfies the requirement of query. Its calculation depends on both the imputation strategyϕand the distance function δ. Consider a dimension incomplete data object X obs. Without knowing the corresponding complete data object X, we can construct the possible complete version of X obs by. assigning a dimension combination{n,..., n Xobs } indicating on which dimensions the data elements are lost; 2. imputing the assigned dimensions according to the imputation strategy ϕ. In this work, we use X rv and X mis for representing the recovery version (i.e., the constructed complete version) and the imputed part respectively. Example : Given X obs = (2, 9, 40), assume the complete form of X obs is known to be of dimensionality 5, and the dimension combination indicating that the data missing positions are{2, 4}. Then we know 2, 9, 40 correspond with the first, third, and fifth dimension of X. If the specified imputation strategy is: the missing elements follow a certain distribution with given expectation and variance, then X rv is a random vector (2, x i, 9, x i2, 40) and X mis = (x i, x i2 ), where x i and x i2 are both random variables following the given distribution. Obviously, there are C X mis possible dimension combinations for the missing data elements, each of which could de- rive a recovery version X rv. Since we have no prior knowledge about the dimension combination or the possible recovery result, we assume the probability of using each recovery result is equal. Therefore, we have X P[δ(Q, X)< r]= rv P[δ(Q, X rv )<r] C X mis In Equation 3, if X rv is generated by imputing random variables that follow a given distribution,δ(q, X rv ) is a new random variable and P[δ(Q, X rv )<r] will be a real value belonging to [0, ]. Now we will give some detailed discussions on the imputation strategy ϕ and the distance function δ. Without loss of generality, in this paper, we assume all imputed random variables are mutually independent and follow normal distribution. The mean of each random variable is decided according to the dimension to be imputed. If x i in X is missing, we impute random variables with expectation (x i + x i+ )/2 (if x i or x i+ does not exist, the value of the nearest existing data element is used instead). For instance, if we impute data to X obs = (, 3, 5) on the the first, third and fourth dimension of X to form a 6-dimensional data object, the recovery version of X will be (,, 2, 2, 3, 5), where the circled values denote the specified mean values of the random variables. In this paper, all random variables are assigned the same variance in one dimension incomplete data object, denoted byσ 2. Specifically, we choose to use the variance of X obs as the variance of imputed random variables in this paper. For many multidimensional data set, this strategy is reasonable since the value of missing data element tends to be related to its neighbor elements [6], while the variance reflects the property of the whole data object. Other strategies for setting mean value and variance can also be adopted in our approach. The imputation strategy depends on specific application scenarios and is independent of our method. For instance, we can use the mean value of X obs as the expectation value of the random variables. For distance function δ, we adopt Euclidean distance which is widely used as similarity metrics in the literature. However, the approach proposed in this paper can be extended to handle other similarity measurements easily. 4 Efficient approach for probabilistic similarity query on dimension incomplete data In this section, we will introduce an efficient approach for probabilistic similarity query on dimension incomplete data. Recall from Section 3, the key point of this task is how to evaluate P[δ(Q, X)<r] efficiently by avoiding enumerating all possible cases. In order to speed up the query process, we utilize a gradual refinement search strategy. Specifically, we propose two pruning methods to help: () lower/upper bounds of confidence, and (2) probability triangle inequality, both of which are computationally efficient and proved to be correct. The overall framework is (3) 83

4 distances for the missing elements: Figure. Overall Query Process shown in Figure. We first use probability triangle inequality to evaluate data objects in the database. In this step, some data objects are judged as true query results and some will be filtered out. Next, we use confidence lower/upper bounds to further evaluate the remaining candidates, from which some are determined as true query results or true dismissals. Only those data objects that can not be judged in the former two steps are evaluated by a naive verification algorithm, which is relatively slow but can guarantee both completeness and correctness of query results. 4. Bounds of probability confidence This section provides the definition of the lower and upper bounds of the probability confidence, the proof of their correctness, and an efficient algorithm for calculating them. To find the bounds of confidence, we need to treat the missing part and the observed part of the dimension incomplete data separately. Given a query Q and a certain recovery version X rv for dimension incomplete data X obs, we have δ 2 (Q, X rv )=δ 2 (Q obs, X obs )+δ 2 (Q mis, X mis ) (4) where Q obs and Q mis are the values on dimensions the same as those of X obs and X mis respectively. Since r 0, we have P[δ(Q, X rv )<r]=p[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] (5) Obviously,δ 2 (Q obs, X obs ) is a real value for given X rv, whileδ 2 (Q mis, X mis ) is a random variable depending on the imputation method. From Q, there are totally C X obs incomplete versions with dimensionality X obs that can be derived by removing values on some dimensions, denoted by Q obs. Then we can find the lower and upper distance bounds between the observed elements of X and Q: δ LBobs (Q, X obs )= min Q obs = X obs δ(q obs, X obs ) (6) δ UBobs (Q, X obs )= max Q obs = X obs δ(q obs, X obs ) (7) Similarly, we can find the lower bound and upper bound δ LBmis (Q, X mis ) =δ(argmin Qmis {δ(q mis, E(X mis )) Q mis = X mis }, X mis ) (8) δ UBmis (Q, X mis ) =δ(argmax Qmis {δ(q mis, E(X mis )) Q mis = X mis }, X mis ) (9) where E(X mis )=(µ,µ 2,...,µ Xobs ), andµ k is the mean value assigned by the imputation method on the k-th dimension of X mis. Example 2: Given a dimension incomplete data X obs =(2, 8, 7). For a query Q=(, 4, 5, 6, 7),δ 2 LB obs (Q, X obs ) will be (2-) 2 +(8-6) 2 +(7-7) 2 =5 corresponding to the recovery version (2,?,?, 8, 7), and δ 2 UB obs (Q, X obs ) will be (2-) 2 +(8-4) 2 +(7-5) 2 =2 corresponding to the recovery version (2, 8, 7,?,?), where? denotes the imputed random variable. For the imputed random variables X mis ={x, x 2 }, according to our imputation policy, E(x ) and E(x 2 ) rely on the dimensions to be imputed. Then δ 2 LB mis (Q, X mis )=(4-x ) 2 +(5-x 2 ) 2 (E(x )=E(x 2 )=5) corresponding to X rv =(2, 5, 5, 8, 7) and δ 2 UB mis (Q, X mis )=(5-x ) 2 +(6-x 2 ) 2 (E(x )=E(x 2 )=7.5) corresponding to X rv =(2, 8,, , 7). Based on the distance bounds above, we can find the lower and upper bounds of P[δ(Q, X rv )<r] according to the following theorem. Theorem (Confidence Lower and Upper Bounds). Given a query Q, threshold r and c, for an incomplete multidimensional data X obs whose complete form is denoted by X, we have (a) P[δ(Q, X)< r] P[δ 2 LB mis (Q, X mis )+δ 2 LB obs (Q, X obs )<r 2 ]. (b) P[δ(Q, X)< r] P[δ 2 UB mis (Q, X mis )+δ 2 UB obs (Q, X obs )<r 2 ]. Proof. (a)for any recovery version X rv of X obs, according to Eq. 5, we have P[δ 2 (Q, X rv )<r 2 ] = P[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] According to Eq. 6 we know Thus (0) δ 2 (Q obs, X obs ) δ 2 LB obs (Q, X obs ) () P[δ 2 (Q mis, X mis )<r 2 δ 2 LB obs (Q, X obs )] P[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] (2) δ 2 (Q mis, X mis )/σ 2 obeys noncentral chi-square distribution with noncentrality parameterλ Xrv =δ 2 (Q mis, E(X mis ))/σ 2 and δ 2 LB mis (Q, X mis )/σ 2 also obeys noncentral chi-square distribution with noncentrality parameter denoted byλ LBmis. According to Eq. 8, we knowλ LBmis λ Xrv. At the same time, 84

5 these two random variables have the same degree of freedom X mis. According to the property of noncentral chisquare distribution, we know P[δ 2 LB mis (Q, X mis )<r 2 δ 2 LB obs (Q, X obs )] P[δ 2 (Q mis, X mis )<r 2 δ 2 LB obs (Q, X obs )] Also considering Eq. 2, we get P[δ 2 LB mis (Q, X mis )+δ 2 LB obs (Q, X obs )<r 2 ] P[δ 2 (Q, X rv )<r 2 ] (b)the proof is similar to that of (a). For simplicity, we have: (3) (4) δ LB (Q, X) = [δ 2 LB mis (Q, X mis )+δ 2 LB obs (Q, X obs )] /2 (5) δ UB (Q, X) = [δ 2 UB mis (Q, X mis )+δ 2 UB obs (Q, X obs )] /2 (6) According to Theorem, P[δ LB (Q, X)<r] and P[δ UB (Q, X)<r] are the upper bound and lower bound of P[δ(Q, X)<r] respectively. These two probability bounds can be used for filtering purpose in the query process. Specifically, we can select data objects with P[δ UB (Q, X)<r] > c as true results and filter out data objects with P[δ LB (Q, X)<r] c as true dismissals. The correctness of this pruning process is guaranteed by the above theorem. To utilize this pruning process, we need efficient algorithms for () calculating the probability P[δ LB (Q, X)<r] and P[δ UB (Q, X)<r], and (2) finding the two distance bound values for the observed part and two distance bounds for the missing part. The first sub-problem can be solved easily. Since δ 2 LB mis (Q mis, X mis )/σ 2 andδ 2 UB mis (Q mis, X mis )/σ 2 obey noncentral chi-square distribution, these two probabilities can be calculated with the cumulative distribution function (cdf) of noncentral chi-square distribution or by a table lookup approach. Consider the second sub-problem. The naive method to compute any one of the four bounds is extremely timeconsuming since we have to enumerate all the C X obs recovery versions. Below we introduce a dynamic programming based algorithm to get these four bounds in O( X obs ( X obs ) 2 ) time. Algorithm is for calculatingδ LBobs andδ LBmis. After the algorithm is executed, the minimum element in the 2n-th column of T isδ LBobs (Q, X obs ) and δ LBmis (Q, X mis ) can be inferred from the assistant array S. In order to calculateδ UBobs andδ UBmis, only small modification is needed: replace function min in line 7, 2 to max and replace argmin in line to argmax. It can be found that the algorithm does not require building the entire table T. Algorithm Calculateδ LBobs andδ LBmis INPUT: Query Q, =m and dimension incomplete data object X obs, X obs =n (0<n<m). OUTPUT:δ LBobs (Q, X obs ) andδ LBmis (Q, X mis ) (inferred from assistant array S ). Initialization Step: Extend X obs to X, ( X =2n+), where X obs [] i=, X i= X obs [n] i=2n+, X obs [i/2] i mod 2=0, X obs [(i )/2]+X obs [(i+)/2] 2 i mod 2=, <i<2n+ Construct two m (2n + ) matrices T and S, where the (i th, j th ) element of T is initialized to (Q i X j )2, S is an assistant array initialized with (0, 0) for each element. : for j= to 2n+ do 2: if j= then 3: for i= to m n do 4: S [i][ j] (i-,) 5: if i> then 6: T[i][ j] T[i][ j]+t[i ][ j] 7: end if 8: end for 9: else if j>2 and jmod2= then 0: for i= ( j+) 2 + to ( j+) 2 +m-n- do : p argmin k ( j+) 2 T[i k][ j 2(k )] 2: T[i][ j] T[i][ j]+t[i p][ j 2(p )] 3: S [i][ j] (i p, j 2(p )) 4: end for 5: else if j>2 and jmod2=0 then 6: for i= j 2 to j 2 +m-n do 7: T[i][ j] T[i][ j]+min ( j 2) 8: end for 9: end if 20: end for 2: return (min n k m T[k][2n]) 2 2 k i T[k][ j 2] Thus its computation complexity is O( X obs ( X obs ) 2 ). Compared with the naive method which has to enumerate all C X obs dimension combinations, it achieves a significant improvement. Example 3: Given a query Q = (3, 7,, 6, 5), and X obs = (2, 4, 8). Then X = ( 2, 2, 3, 4, 6, 8, 8 ), where the circled elements are added by the imputation strategy. The initialized T is shown in Figure 2(a). The algorithm starts the calculation from the bottom of the first column to top right. In step., T[][]= remains unchanged. T[][2] = 25 is replaced with T[][]+T[2][]=25+=26. In the second column, we do nothing. In Step.2, we deal 85

6 (a) Initialized T (d) Step (g) Step (b) Step (e) Step (c) Step (f) Step.5 5 (0,0) (,) 6 (0,0) (,) (0,0) (0,0) (0,0) (,) (0,0) (0,0) (0,0) 7 (,) (0,0) (0,0) (0,0) 3 (0,0) (0,0) (h) Assistant Array S Figure 2. Example of gettingδ LBobs andδ LBmis with the third column of T. T[3][3]=4 is replaced with T[3][3]+min{T[2][3], T[][]}=4+min{6, } = 5. The remaining steps are shown in Figure 2. Finally, from the 6-th column in Figure 2(g), we findδ 2 LB obs (Q, X obs ) is 4, which is the minimal element in the column. In order to find δ 2 LB mis (Q, X mis ), we first find the minimal value among those in the top of column,3,5,7 of T that is min{26,5,,0}. We find T[4][5]= is the minimal. Thus the imputed variable with mean value 6 will be matched to 6 in Q. Then S [4][5]=<, > indicates the previous match is in the first row and first column of T, where the corresponding imputation is with mean value 2 and is matched to 3 in Q. Thus, δ 2 LB mis (Q, X mis )=(3 x ) 2 +(6 x 2 ) 2, where x and x 2 are imputed random variables with mean value 2 and 6 respectively. 4.2 Probability triangle inequality This section presents a probability triangle inequality which is also used for results pruning during query process. Theorem 2 (Probability Triangle Inequality). Given a query Q and a multidimensional data object R (= R ). For a dimension incomplete data object X obs whose underlying complete version is X, we have: (a)p[δ(q, X)<r]<P[δ LB (R, X) δ(q, R)<r]; (b)p[δ(q, X)<r]>P[δ UB (R, X)+δ(Q, R)<r]. Proof. (a) From Theorem, we have P[δ LB (R, X) δ(q, R)< r] P[δ(R, X)<δ(Q, R)+r] (7) Thus, for any recovery version X rv of X, we have P[δ(R, X rv ) δ(q, R)<r] P[δ LB (R, X) δ(q, R)<r] (8) In metric space, the triangle inequality holds, thus Thus we have δ(r, X rv ) δ(q, R)<δ(Q, X rv ) (9) P[δ(Q, X rv )<r]<p[δ(r, X rv ) δ(q, R)<r] (20) Therefore, P[δ LB (R, X) δ(q, R)<r]>P[δ(Q, X)< r] (2) (b) The proof is similar to that of (a). Based on the theorem above, with the help of assistant data object R, some data objects in database can be determined to be true results (when P[δ UB (R, X)+δ(Q, R)< r] c) or true dismissals (when P[δ LB (R, X) δ(q, R)<r] c). Since the required dynamic programming computation can be finished in advance without knowing the query, this evaluating process can be done in O() time. Example 4: Given Q=(2, 6, 2, 5, 5), R=(3, 7,, 6, 5), X obs =(2, 4, 8), r = 2, and c = 0.2. Refer to Example 3, we knowδ LBobs (R, X obs )=4 andδ LBmis (R, X mis )=(3 x ) 2 +(6 x 2 ) 2, where x and x 2 are imputed random variables with mean value 2 and 6 respectively. The variances (σ 2 ) of them are both (the variance of X obs ). Thus,δ 2 LB mis (R, X mis )/σ 2 obeys noncentral chi-square distribution with degree of freedom 2 and noncentrality parameter [(3 2) 2 + (6 6) 2 ]/6.222=. Then, P[δ LB (R, X) δ(q, R) < r]=p{δ 2 LB mis (R, X mis )/σ 2 < [(r+ 2) 2 4]/σ 2 }=P[δ 2 LB mis (R, X mis )/σ 2 < 0.324]=0.379<c. Therefore, based on Theorem 2, we know P[δ(Q, X) < r]<c, indicating X obs is not a result of query Q. In order to get P[δ LB (R, X) δ(q, R)<r] and P[δ UB (R, X)+δ(Q, R)<r] efficiently, two sets of values need to be stored for each (R, X obs ) pair. One is δ LBobs (R, X obs ) andδ UBobs (R, X obs ). The other is the noncentrality parameters of random variableδ 2 LB mis (R, X mis )/σ 2 and δ 2 UB mis (R, X mis )/σ 2. So the number of assistant data objects controls the tradeoff between query processing time and storage. 4.3 Our similarity query method Our approach (termed PSQ in this work) for handing the problem of probabilistic similarity query on dimension incomplete data is described in Algorithm 2. In this algorithm, we utilize a gradual refinement searching strategy with aforementioned pruners to speed up the query process. Specifically, we first use assistant objects in S R to examine data objects in the database based on probability triangle inequality. Then confidence lower and upper bounds are used to further evaluate the remaining candidates. Only those data objects that the former two steps 86

7 cannot judge will be evaluated by the naive verification algorithm. Recall from Eq. 3 in Section 3, the straightforward way for confidence evaluation is to examine all possible recovery versions of the incomplete data. Here we also provide an optimized enumeration process that prunes out some cases safely. From Eq. 4, we find that ifδ(q obs, X obs ) r, P[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] will be 0. Intuitively, if we can judgeδ LBobs (Q, X obs ) rand thusδ(q obs, X obs ) r, there will be no need to examine X obs. Furthermore, according to Eq. 6, obviously, for a given query Q and an incomplete data object X obs ( X obs <), if there is a Q derived from Q ( Q X obs ) that satisfiesδ LBobs (Q, X obs ) r, then for all Q obs derived from Q ( Q obs X obs ) we have δ LBobs (Q obs, X obs) r. Thus, we only need to evaluate part of the C X obs recovery versions that yield confidence larger than 0 in the following way. Enumerate C X obs recovery versions with a recursive procedure: for Q obs derived from Q ( Q obs X obs ), ifδ LBobs (Q obs, X obs ) r, all Q obs derived from Q obs will not be evaluated. Due to space limitation, we will not discuss the details of this recursive procedure. When our algorithm is used, usually a large portion of time will be consumed by the naive verification process. In order to have higher efficiency, however, we can avoid the naive verification process and simply regard the remaining candidates as query results (or dismissals, depending on the requirements of query and ). Such a strategy is reasonable for some applications where the two probability confidence bounds are effective for pruning results. In this case, this simplified algorithm will not cause remarkable decrease in query results quality. Our experimental results shown in Section justify the effectiveness of such a simplified strategy. 5 Experimental study In this section, we present the experimental results. The goal of our experiments is to (a) evaluate the effectiveness and the efficiency of the overall method for probabilistic similarity query on dimension incomplete data and various key techniques proposed, (b) study the influence of different parameters and data sets on our method, (c) compare the performance of our approach with other solutions for handling dimension incomplete data. 5. Data sets Two real data sets are used in our experiments. The first one is the Standard and Poor 500 index historical stock data. This data set contains stock prices of about 54 companies collected over one year. We use the opening Algorithm 2 Probabilistic Similarity Query INPUT: the dimension incomplete database D, query Q, the set of assistant data objects S R. OUTPUT: the results set S result. : for all X in D do 2: for all R in S R do 3: if P[δ LB (R, X) δ(q, R)<r] cthen 4: X S result 5: goto to evaluate next X in D 6: else if P[δ UB (R, X)+δ(Q, R)<r] c then 7: X S result 8: goto to evaluate next X in D 9: end if 0: end for : if P[δ LB (Q, X)<r] cthen 2: X S result 3: else if P[δ UB (Q, X)<r]>cthen 4: X S result 5: else 6: do naive confidence evaluation 7: end if 8: end for price data of each stock, which is a vector of 25 dimensions. Since the final step of query process is very timeconsuming, we need to sample the original data and have a lower dimension data set. Thus, we construct a new one with 30 dimensions by segmenting the data in Standard and Poor 500 index historical stock data set, resulting in totally 54 8 = 4, 328 data objects with 30 dimensions (denoted by S&P500). Another data set contains 32-dimensional image features extracted from 68,040 images (denoted by IM- AGE) 2. For both data sets, their original data objects are complete. Similarity query results on the complete data are used as ground truth in evaluating and of our approach. We construct the dimension incomplete data set by randomly removing some dimensions of each data object. The number of missing data elements are controlled by missing ratio which is the ratio of the number of missing dimensions over the original dimensionality. Totally 00 data objects, which are randomly sampled from the data set, are used as queries. 5.2 Results and analysis 5.2. Effectiveness of probabilistic similarity query on dimension incomplete data Experiments in this section is to evaluate the effectiveness of our method on various data sets, and with various param

8 PSQ (c=0.) (a) 5% missing ratio 8 6 PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio PSQ (c=0.) (a) 5% missing ratio PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio Figure 3. Query on S&P500 data set Figure 5. Query on IMAGE data set PSQ (c=0.) (a) 5% missing ratio PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio PSQ (c=0.) (a) 5% missing ratio PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio Figure 4. Query on S&P500 data set Figure 6. Query on IMAGE data set eter settings. Figure 3, 4, 5 and 6 show the quality of query result measured by and. It can be observed from the results that our method (PSQ) achieves a satisfactory performance in querying dimension incomplete data. Particularly, though both and decrease with the increase of missing ratio, even when 5% dimensions are missing, our method (PSQ) achieves more than and on S&P500 data set. For image histogram data set, if distance threshold and confidence threshold are well chosen, a good query quality can also be achieved. This justifies the usefulness of our approach in real applications. We also compare our approach with a simple method (), which () randomly removes some elements of the query to construct a new query with the same dimensionality as the dimension incomplete candidates in database, and (2) uses Euclidean distance to measure whether their distance is lower than threshold r. From the results, we can see that our method is able to better reflect the distance and thus achieves better query quality. Moreover, and on S&P500 data set are higher than those on the image histogram data set. This is due to the intrinsic characteristics of these two data sets. S&P500 data set holds the typical characteristics of time series and has an excellent correlation between the consecutive data elements, while image histogram data does not have this property. Therefore, the imputation method used in our experiment fits better for S&P500 data set. This also shows the importance of assigning a suitable imputation method in handling the dimension incomplete data Effect of the confidence threshold It can be observed from Figure 3, 4, 5 and 6 that value decreases and value increases with the increase of the confidence threshold. To make it clearer, Figure 7 shows the relationship between the confidence threshold c and / (missing ratio=0., r=60 for S&P500 and r= for IMAGE). This experiment also indicates that by setting a proper c, our method is able to achieve both good and good on real data sets Effectiveness of different pruners In this section, we study the usefulness of the four pruners proposed in this paper by examining their pruning power. It is measured by N de f inite /N processed where N processed is the number of the data objects in the database, and N de f inite is the number of data objects in the database judged as dismissals or search results by the pruner. Figure 8 shows the pruning power of probability triangle inequality with various number of assistant data objects (c=0.2). For S&P500 data, when only 0 assistant data objects were used, the pruning power is more than 60%. For image histogram data, in most cases, the pruning power of the probability triangle inequality is more than 20% with 20 assistant data objects. The results show that the probability triangle inequality has good pruning power in query process by involving only / confidence threshold (c) (a) S&P500 data set / confidence threshold (c) (b) IMAGE data set Figure 7. Confidence threshold vs 88

9 r=00 r=200 r= number of assistant objects (a) 5% missing ratio (S&P500) r=00 r=200 r= number of assistant objects (b) 0% missing ratio (S&P500) r=00 r=200 r= number of assistant objects (c) 5% missing ratio (S&P500).5 pruner pruner2 pruner3 pruner4.5 pruner4 pruner3 pruner2 pruner r=0. r=0.2 r= number of assistant objects (d) 5% missing ratio (IMAGE) r=0. r=0.2 r= number of assistant objects (e) 0% missing ratio (IMAGE) r=0. r=0.2 r= numver of assistant objects (f) 5% missing ratio (IMAGE) Figure 8. Pruning power of probability triangle inequality a few assistant data objects. Moreover, the performance will improve when more assistant data objects are available. But after the pruning power reaches a certain level, the increase of assistant data objects has no significant further impact on the pruning power. We also examined the pruning power of the four pruners proposed in this paper, including probability triangle inequality using confidence lower bound and confidence upper bound (denoted by pruner and pruner2 respectively), confidence lower and upper bounds (denoted by pruner3 and pruner4 respectively). Figure 9 shows the pruning power of each pruner with various r (missing ratio=0%, c=0., 20 assistant objects). Firstly, this justifies the usefulness of each pruner proposed in this work. Secondly, we can find for S&P500 data set, about 90% data in total can be pruned, which means only a small part of data need to do naive verification. For image histogram data set, in the worst case, there will be 50% data need to do naive verification. Thirdly, the pruning power of these four pruners are influenced by threshold r significantly. When specifying a smaller r (i.e., the user wants to get relatively a small amount of query results), more data are pruned by two lower bound based pruners pruner and pruner3. In contrast, a larger distance threshold produces a larger pruning power for pruner2 and pruner4. Since the time complexity of naive verification is poor, we study if naive verification is necessary. We try two simplified verification strategies: for data objects that the former four pruners cannot judge, strategy Pos simply outputs them as query results, and strategy Neg, by contrast, judges them as dismissals. Obviously, Pos might result in more false positives, while Neg might produce more false negatives. The query quality of these two strategies and doing naive verification (denoted by DoN) are shown in Table (c=0.). From the results, we can find for S&P500 data 0 S&P500 (r=40) IMAGE (r=) (a) S&P500 data set (b) IMAGE data set Figure 9. Pruning power of four pruners Table. Comparison of query quality missing ratio 5% 0% 5% Neg Pos DoN Neg Pos DoN set, the and without doing naive verification are very close to those doing naive verification. For image histogram data, however, the query quality depends more heavily on the naive verification. It can be concluded from this result that for some data sets, a high query quality can be achieved without utilizing the slow naive verification process Performance analysis There are mainly three steps in our approach: () pruning with probability triangle inequality; (2) pruning with confidence lower and upper bounds; (3) naive confidence verification. We test time costs of these steps respectively using S&P500 data set on a computer with 3.0GHz CPU and.0gb RAM, and average them over all queries. Results are shown in Figure 0. Particularly, naive confidence ver- time cost (microsecond) 0 5 confidence bounds pruning probability triangle inequality naive confidence verification number of missing elements Figure 0. Time cost 89

10 ification takes much longer time than the other two steps. However, Table in section indicates that the naive confidence verification process is not so necessary for handling S&P500 data set, which means the efficiency of the overall query processing system can be improved significantly without losing much in performance. 6 Conclusions This paper addresses the similarity query problem on dimension incomplete data, which is of both practical importance and technical challenge. We adopt a probability framework to model this problem. In order to solve this problem efficiently, an approach is introduced based on the proposed lower/upper confidence bounds and the probability triangle inequality. The proposed methods are proved to be theoretically correct. Given a query Q and a database containing dimension incomplete data X obs, compared with the brute force method whose time complexity is O( C X obs ), our method achieves a significant improvement where most data objects can be handled in O( X obs ( X obs ) 2 ) or even O() time. Experiments are conducted on two real data sets. The results indicate: () our approach achieves satisfactory performance in querying dimension incomplete data; (2) both the probability triangle inequality and the confidence bounds have a nice pruning power and improve query efficiency significantly. This verifies that our method is promising in handling dimension incomplete data. Our future work will focus on the following aspects. Since a probability triangle inequality holds, we plan to develop an index structure to make the query process faster. Besides that, it will be interesting and useful to extend our query strategy to fit for other similarity measurements. Acknowledgment The work was supported by NSFC , and 863 funding 2007AA0Z56. References [] C. C. Aggarwal and S. Parthasarathy. Mining massively incomplete data sets by conceptual reconstruction. In Proceedings ACM SIGKDD 0, pages , 200. [2] O. Benjelloun, A. Das, S. Alon, and H. J. Widom. Uldbs: Databases with uncertainty and lineage. In Proceedings of VLDB 06, pages , [3] B. Bollobas, G. Das, D. Gunopulos, and H. Mannila. Timeseries similarity problems and well-separated geometric sets. In Proceedings of SCG 97, pages , 997. [4] D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Olap over uncertain and imprecise data. In Proceedings of VLDB 05, pages , [5] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu. Indexing incomplete database. In Proceedings of EDBT 06, [6] R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In Proceedings of ACM SIGMOD 03, pages , [7] R. Cheng and S. Prabhakar. Managing uncertainty in sensor database. ACM SIGMOD Record, 32:4 46, [8] D. Gu and Y. Gao. Incremental gradient descent imputation method for missing data in learning classifier systems. In Proceedings of GECCO 05, pages 72 73, [9] J. Gu and X. Jin. Similarity search over incomplete symbolic sequences. In Proceedings of DEXA 07, pages , [0] M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: a probabilistic threshold approach. In Proceedings of ACM SIGMOD 08, pages , [] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. Mcdb: a monte carlo approach to managing uncertain data. In Proceedings of ACM SIGMOD 08, pages , [2] E. Keogh. Exact indexing of dynamic time warping. In Proceedings of VLDB 02, pages , [3] E. Keogh and M. Pazzani. Scaling up dynamic time warping to massive datasets. In Proceedings of ECML/PKDD 99, 999. [4] S. Khanna and W.-C. Tan. On computing functions with uncertainty. In Proceedings of ACM PODS 0, pages 7 82, 200. [5] K. Lakshminarayan, S. A. Harp, and T. Samad. Imputation of missing data in industrial databases. Applied Intelligence, : , 999. [6] R. A. Little and D. B. Rubin. Statistical analysis with missing data. Wiley Series in Probability and Statistics,st, 987. [7] B. C. Ooi, B. Chin, O. Cheng, C. H. Goh, and K. lee Tan. Fast high-dimensional data search in incomplete databases. In Proceedings of VLDB 98, 998. [8] R. K. Pearson. The problem of disguised missing data. ACM SIGKDD Explorations Newsletter, pages 83 92, [9] J. Pei, M. Hua, Y. Tao, and X. Lin. Query answering techniques on uncertain and probabilistic data: tutorial summary. In Proceedings of ACM SIGMOD 08, pages , [20] J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In Proceedings of VLDB 07, pages 5 26, [2] A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In Proceedings of ICDE 06, page 7, [22] I. Wasito and B.Mirkin. Nearest neighbour approach in the least-squares data imputation algorithms. Information Sciences: an International Journal, 69: 25, [23] D. Williams, X. Liao, Y. Xue, and L. Carin. Incompletedata classification using logistic regression. In Proceedings of ICML 05, pages ,

Searching Dimension Incomplete Databases

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract