Probabilistic Similarity Query on Dimension Incomplete Data

Size: px
Start display at page:

Download "Probabilistic Similarity Query on Dimension Incomplete Data"

Transcription

1 2009 Ninth IEEE International Conference on Data Mining Probabilistic Similarity Query on Dimension Incomplete Data Wei Cheng School of Software Tsinghua University Xiaoming Jin School of Software Tsinghua University Jian-Tao Sun Microsoft Research Asia Abstract Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data. Introduction Multidimensional data, such as time series and feature vectors extracted from images, are widely used in various applications. Similarity query on multidimensional data (i.e., to retrieve similar data objects from multidimensional database given a data object as an input query) has attracted many research interests as it plays an important role in many data mining, database and information retrieval tasks. This problem is challenging when the data is incomplete [, 4, 23], which may be caused by various reasons. For example, in sensor network applications, the data collected may become incomplete when sensors do not work properly or when errors occur during data transfer process. In the literature, data incompleteness problem has been well researched (e.g., see [7, 5]). In these works, data incompleteness usually refers to missing value: data values for some dimensions are unknown (or uncertain), but it is known that, for each dimension, if the corresponding data value is missing or not. However, in practice, it is quite common that we do not know which dimensions (or positions) have data loss [9]. In other words, the dimensionality of the collected data is lower than its actual dimensionality, and we lose the correspondence relationship between data dimensions and their associated values. This is regarded as dimension incompleteness in this work. Take sensor network application as an example, the database usually contains time series data objects, each of which is represented by a sequence of values x, x 2,..., x n. The dimension information (e.g., time stamp) associated with data elements can be implicitly inferred from the order of data arrival. This schema of data collection and storage is very common in resource-constrained applications since explicitly maintaining dimension information will cause additional costs. Therefore, even missing a single data element will destroy the dimension information of the entire data object. Besides, in some applications where dimension information is explicitly maintained, the dimension indicator itself may also be lost which will cause the data to become dimension incomplete. More generally, for some data sets containing time series with various lengths, it can be assumed that these data sets are originally generated with the same dimensionality and then some dimensions are lost. So the dimension incomplete data is quite common in practical applications /09 $ IEEE DOI 0.09/ICDM

2 The incompleteness of data dimension brings challenges to similarity query task, as dimension information is essential for existing techniques used to handle uncertain data [6, 0]. Given a query object and a dimension incomplete data object, the similarity measurement between them is fundamental for similarity query task but becomes impractical because their dimensions do not match. For instance, the widely used L p -norms distance, L p (x, y) = ( n i= x i y i p ) /p, can not be calculated in this case because it does not allow shifting of data dimension. Other similarity computation methods like DTW [3] and LCSS [3] are mainly proposed to better measure the similarity between two data objects without considering dimension incompleteness. They can not be used to deal with dimension incomplete data, as their underlying matching strategies are not designed to capture the characteristics of data in dimension incomplete scenario. One straightforward solution is to consider all possible dimension missing cases and calculate the similarity accordingly. But this procedure may become extremely time consuming. Assume the original dimensionality of data objects and query is m, but only n dimensions of a data object are observed (n<m), there are Cm n possible dimension combinations to be examined when the similarity is computed. Thus better solutions are needed to deal with this problem. In this paper, we model the problem of similarity query on dimension incomplete data with a probabilistic framework. Based on our framework, users can give a distance threshold to specify the allowed distance between the query and dimension incomplete data objects, and a probability confidence threshold to specify the requirement that the retrieval results should satisfy the distance condition at least with the given probability. Our query approach is based on the fact that the relationship between query and dimension incomplete data objects can be referred. Such relationship information may provide helpful guides for performing similarity query task. An efficient method is proposed to find lower and upper bounds of the probability that a data object satisfies the query. These bounds can be used to () eliminate data objects that are judged as dismissals, and (2) keep qualified ones in O(n(m n) 2 ) time. Furthermore, based on the proposed probability triangle inequality, an approach with time complexity O(m) is introduced to further speed up the similarity query process. Our proposed method is proved to be theoretically correct. Experiments on two real data sets indicate that our method is promising in doing similarity query on dimension incomplete data. 2 Related work The problem of missing data values has been well researched in the literature [5, 8, 8, 22]. In their research, the dimensions with data missing are known and their corresponding values are estimated [5]. There are also some works that deal with data uncertainty (e.g., [20, 6, 7, 4, 2,, 9, 2]), which is related to but different from dimension incompleteness problem. These works consider the uncertainty of data values and estimate a probability density function (pdf) to model the uncertainty of values for data elements. For example, it is addressed in [7] that the recorded value is likely to be different from the actual value. In such cases, queries may also be uncertain. These works differ from ours in that they only consider the uncertainty of data values, while we consider the uncertainty of dimension as well. In [9], missing data elements in symbolic sequence is addressed. In this work, a more general and more challenging problem is studied. We consider real-valued multidimensional data and address the probabilistic query task, both of which are of essential difference from [9]. The problem of similarity query on dimension incomplete data cannot be solved by algorithms like Dynamic Time Warping (DTW) [3, 2] or Longest Common Subsequence distance (LCSS) [3]. DTW and LCSS are designed to target a goal different from ours. They match two multidimensional data objects focusing on their common things, not for measuring the similarity between objects with missing dimension information. 3 Problem description and analysis D={X, X 2,..., X N } is a database containing the multidimensional data. A data object X from D is a real-valued vector: (x, x 2,..., x M ), where x m ( m M) is the data value for the m-th dimension of X. X = M denotes the dimensionality of X. D is said to be incomplete if its data objects are allowed to have missing values or dimensions. Otherwise, D is complete. In this work, one data object is regarded as dimension incomplete, if it satisfies that (a) at least one of its data elements is missing; (b) the dimension of the missing data element cannot be determined. For example, given a complete data object X, if its k data elements are missing, the resulting dimension incomplete data is of the form X obs = (x n, x n2,..., x nm ) where n j < n j+, M = X k. Conventional range query in a multidimensional database is defined as: given a database D containing N data objects of M dimensions, an M-dimensional query Q, a query threshold r and a distance functionδ, to retrieve all the data objects in the database D that have a distance away from Q less than r: RangeQuery δ (D, Q, r)={x D δ(q, X)< r} () Apparently, it is not practical to measure the exact similarity between dimension incomplete data object and query, because the associated dimensions cannot be well aligned. 82

3 Thus the similarity score is uncertain and will depend on both the dimension alignment and the values estimated for the missing dimensions. In this work, we address the probabilistic similarity query problem on dimension incomplete database: to retrieve data objects from the database with high probability of satisfying the input query. This problem can be formulated as follows: Definition (Probabilistic Similarity Query on Dimension Incomplete Data (PSQ-DID)). Given a database D containing dimension incomplete multidimensional data objects X obs whose underlying complete version is denoted by X, a query Q that is complete data, a distance threshold r, a confidence threshold c, an imputation method ϕ indicating the distribution of missing data values, and a distance function δ, PSQ DID δ,ϕ (D, Q, r, c)={x obs D P[δ(Q, X)< r]>c} (2) P[δ(Q, X) < r], termed confidence in this paper, indicates the probability that the underlying complete data object X satisfies the requirement of query. Its calculation depends on both the imputation strategyϕand the distance function δ. Consider a dimension incomplete data object X obs. Without knowing the corresponding complete data object X, we can construct the possible complete version of X obs by. assigning a dimension combination{n,..., n Xobs } indicating on which dimensions the data elements are lost; 2. imputing the assigned dimensions according to the imputation strategy ϕ. In this work, we use X rv and X mis for representing the recovery version (i.e., the constructed complete version) and the imputed part respectively. Example : Given X obs = (2, 9, 40), assume the complete form of X obs is known to be of dimensionality 5, and the dimension combination indicating that the data missing positions are{2, 4}. Then we know 2, 9, 40 correspond with the first, third, and fifth dimension of X. If the specified imputation strategy is: the missing elements follow a certain distribution with given expectation and variance, then X rv is a random vector (2, x i, 9, x i2, 40) and X mis = (x i, x i2 ), where x i and x i2 are both random variables following the given distribution. Obviously, there are C X mis possible dimension combinations for the missing data elements, each of which could de- rive a recovery version X rv. Since we have no prior knowledge about the dimension combination or the possible recovery result, we assume the probability of using each recovery result is equal. Therefore, we have X P[δ(Q, X)< r]= rv P[δ(Q, X rv )<r] C X mis In Equation 3, if X rv is generated by imputing random variables that follow a given distribution,δ(q, X rv ) is a new random variable and P[δ(Q, X rv )<r] will be a real value belonging to [0, ]. Now we will give some detailed discussions on the imputation strategy ϕ and the distance function δ. Without loss of generality, in this paper, we assume all imputed random variables are mutually independent and follow normal distribution. The mean of each random variable is decided according to the dimension to be imputed. If x i in X is missing, we impute random variables with expectation (x i + x i+ )/2 (if x i or x i+ does not exist, the value of the nearest existing data element is used instead). For instance, if we impute data to X obs = (, 3, 5) on the the first, third and fourth dimension of X to form a 6-dimensional data object, the recovery version of X will be (,, 2, 2, 3, 5), where the circled values denote the specified mean values of the random variables. In this paper, all random variables are assigned the same variance in one dimension incomplete data object, denoted byσ 2. Specifically, we choose to use the variance of X obs as the variance of imputed random variables in this paper. For many multidimensional data set, this strategy is reasonable since the value of missing data element tends to be related to its neighbor elements [6], while the variance reflects the property of the whole data object. Other strategies for setting mean value and variance can also be adopted in our approach. The imputation strategy depends on specific application scenarios and is independent of our method. For instance, we can use the mean value of X obs as the expectation value of the random variables. For distance function δ, we adopt Euclidean distance which is widely used as similarity metrics in the literature. However, the approach proposed in this paper can be extended to handle other similarity measurements easily. 4 Efficient approach for probabilistic similarity query on dimension incomplete data In this section, we will introduce an efficient approach for probabilistic similarity query on dimension incomplete data. Recall from Section 3, the key point of this task is how to evaluate P[δ(Q, X)<r] efficiently by avoiding enumerating all possible cases. In order to speed up the query process, we utilize a gradual refinement search strategy. Specifically, we propose two pruning methods to help: () lower/upper bounds of confidence, and (2) probability triangle inequality, both of which are computationally efficient and proved to be correct. The overall framework is (3) 83

4 distances for the missing elements: Figure. Overall Query Process shown in Figure. We first use probability triangle inequality to evaluate data objects in the database. In this step, some data objects are judged as true query results and some will be filtered out. Next, we use confidence lower/upper bounds to further evaluate the remaining candidates, from which some are determined as true query results or true dismissals. Only those data objects that can not be judged in the former two steps are evaluated by a naive verification algorithm, which is relatively slow but can guarantee both completeness and correctness of query results. 4. Bounds of probability confidence This section provides the definition of the lower and upper bounds of the probability confidence, the proof of their correctness, and an efficient algorithm for calculating them. To find the bounds of confidence, we need to treat the missing part and the observed part of the dimension incomplete data separately. Given a query Q and a certain recovery version X rv for dimension incomplete data X obs, we have δ 2 (Q, X rv )=δ 2 (Q obs, X obs )+δ 2 (Q mis, X mis ) (4) where Q obs and Q mis are the values on dimensions the same as those of X obs and X mis respectively. Since r 0, we have P[δ(Q, X rv )<r]=p[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] (5) Obviously,δ 2 (Q obs, X obs ) is a real value for given X rv, whileδ 2 (Q mis, X mis ) is a random variable depending on the imputation method. From Q, there are totally C X obs incomplete versions with dimensionality X obs that can be derived by removing values on some dimensions, denoted by Q obs. Then we can find the lower and upper distance bounds between the observed elements of X and Q: δ LBobs (Q, X obs )= min Q obs = X obs δ(q obs, X obs ) (6) δ UBobs (Q, X obs )= max Q obs = X obs δ(q obs, X obs ) (7) Similarly, we can find the lower bound and upper bound δ LBmis (Q, X mis ) =δ(argmin Qmis {δ(q mis, E(X mis )) Q mis = X mis }, X mis ) (8) δ UBmis (Q, X mis ) =δ(argmax Qmis {δ(q mis, E(X mis )) Q mis = X mis }, X mis ) (9) where E(X mis )=(µ,µ 2,...,µ Xobs ), andµ k is the mean value assigned by the imputation method on the k-th dimension of X mis. Example 2: Given a dimension incomplete data X obs =(2, 8, 7). For a query Q=(, 4, 5, 6, 7),δ 2 LB obs (Q, X obs ) will be (2-) 2 +(8-6) 2 +(7-7) 2 =5 corresponding to the recovery version (2,?,?, 8, 7), and δ 2 UB obs (Q, X obs ) will be (2-) 2 +(8-4) 2 +(7-5) 2 =2 corresponding to the recovery version (2, 8, 7,?,?), where? denotes the imputed random variable. For the imputed random variables X mis ={x, x 2 }, according to our imputation policy, E(x ) and E(x 2 ) rely on the dimensions to be imputed. Then δ 2 LB mis (Q, X mis )=(4-x ) 2 +(5-x 2 ) 2 (E(x )=E(x 2 )=5) corresponding to X rv =(2, 5, 5, 8, 7) and δ 2 UB mis (Q, X mis )=(5-x ) 2 +(6-x 2 ) 2 (E(x )=E(x 2 )=7.5) corresponding to X rv =(2, 8,, , 7). Based on the distance bounds above, we can find the lower and upper bounds of P[δ(Q, X rv )<r] according to the following theorem. Theorem (Confidence Lower and Upper Bounds). Given a query Q, threshold r and c, for an incomplete multidimensional data X obs whose complete form is denoted by X, we have (a) P[δ(Q, X)< r] P[δ 2 LB mis (Q, X mis )+δ 2 LB obs (Q, X obs )<r 2 ]. (b) P[δ(Q, X)< r] P[δ 2 UB mis (Q, X mis )+δ 2 UB obs (Q, X obs )<r 2 ]. Proof. (a)for any recovery version X rv of X obs, according to Eq. 5, we have P[δ 2 (Q, X rv )<r 2 ] = P[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] According to Eq. 6 we know Thus (0) δ 2 (Q obs, X obs ) δ 2 LB obs (Q, X obs ) () P[δ 2 (Q mis, X mis )<r 2 δ 2 LB obs (Q, X obs )] P[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] (2) δ 2 (Q mis, X mis )/σ 2 obeys noncentral chi-square distribution with noncentrality parameterλ Xrv =δ 2 (Q mis, E(X mis ))/σ 2 and δ 2 LB mis (Q, X mis )/σ 2 also obeys noncentral chi-square distribution with noncentrality parameter denoted byλ LBmis. According to Eq. 8, we knowλ LBmis λ Xrv. At the same time, 84

5 these two random variables have the same degree of freedom X mis. According to the property of noncentral chisquare distribution, we know P[δ 2 LB mis (Q, X mis )<r 2 δ 2 LB obs (Q, X obs )] P[δ 2 (Q mis, X mis )<r 2 δ 2 LB obs (Q, X obs )] Also considering Eq. 2, we get P[δ 2 LB mis (Q, X mis )+δ 2 LB obs (Q, X obs )<r 2 ] P[δ 2 (Q, X rv )<r 2 ] (b)the proof is similar to that of (a). For simplicity, we have: (3) (4) δ LB (Q, X) = [δ 2 LB mis (Q, X mis )+δ 2 LB obs (Q, X obs )] /2 (5) δ UB (Q, X) = [δ 2 UB mis (Q, X mis )+δ 2 UB obs (Q, X obs )] /2 (6) According to Theorem, P[δ LB (Q, X)<r] and P[δ UB (Q, X)<r] are the upper bound and lower bound of P[δ(Q, X)<r] respectively. These two probability bounds can be used for filtering purpose in the query process. Specifically, we can select data objects with P[δ UB (Q, X)<r] > c as true results and filter out data objects with P[δ LB (Q, X)<r] c as true dismissals. The correctness of this pruning process is guaranteed by the above theorem. To utilize this pruning process, we need efficient algorithms for () calculating the probability P[δ LB (Q, X)<r] and P[δ UB (Q, X)<r], and (2) finding the two distance bound values for the observed part and two distance bounds for the missing part. The first sub-problem can be solved easily. Since δ 2 LB mis (Q mis, X mis )/σ 2 andδ 2 UB mis (Q mis, X mis )/σ 2 obey noncentral chi-square distribution, these two probabilities can be calculated with the cumulative distribution function (cdf) of noncentral chi-square distribution or by a table lookup approach. Consider the second sub-problem. The naive method to compute any one of the four bounds is extremely timeconsuming since we have to enumerate all the C X obs recovery versions. Below we introduce a dynamic programming based algorithm to get these four bounds in O( X obs ( X obs ) 2 ) time. Algorithm is for calculatingδ LBobs andδ LBmis. After the algorithm is executed, the minimum element in the 2n-th column of T isδ LBobs (Q, X obs ) and δ LBmis (Q, X mis ) can be inferred from the assistant array S. In order to calculateδ UBobs andδ UBmis, only small modification is needed: replace function min in line 7, 2 to max and replace argmin in line to argmax. It can be found that the algorithm does not require building the entire table T. Algorithm Calculateδ LBobs andδ LBmis INPUT: Query Q, =m and dimension incomplete data object X obs, X obs =n (0<n<m). OUTPUT:δ LBobs (Q, X obs ) andδ LBmis (Q, X mis ) (inferred from assistant array S ). Initialization Step: Extend X obs to X, ( X =2n+), where X obs [] i=, X i= X obs [n] i=2n+, X obs [i/2] i mod 2=0, X obs [(i )/2]+X obs [(i+)/2] 2 i mod 2=, <i<2n+ Construct two m (2n + ) matrices T and S, where the (i th, j th ) element of T is initialized to (Q i X j )2, S is an assistant array initialized with (0, 0) for each element. : for j= to 2n+ do 2: if j= then 3: for i= to m n do 4: S [i][ j] (i-,) 5: if i> then 6: T[i][ j] T[i][ j]+t[i ][ j] 7: end if 8: end for 9: else if j>2 and jmod2= then 0: for i= ( j+) 2 + to ( j+) 2 +m-n- do : p argmin k ( j+) 2 T[i k][ j 2(k )] 2: T[i][ j] T[i][ j]+t[i p][ j 2(p )] 3: S [i][ j] (i p, j 2(p )) 4: end for 5: else if j>2 and jmod2=0 then 6: for i= j 2 to j 2 +m-n do 7: T[i][ j] T[i][ j]+min ( j 2) 8: end for 9: end if 20: end for 2: return (min n k m T[k][2n]) 2 2 k i T[k][ j 2] Thus its computation complexity is O( X obs ( X obs ) 2 ). Compared with the naive method which has to enumerate all C X obs dimension combinations, it achieves a significant improvement. Example 3: Given a query Q = (3, 7,, 6, 5), and X obs = (2, 4, 8). Then X = ( 2, 2, 3, 4, 6, 8, 8 ), where the circled elements are added by the imputation strategy. The initialized T is shown in Figure 2(a). The algorithm starts the calculation from the bottom of the first column to top right. In step., T[][]= remains unchanged. T[][2] = 25 is replaced with T[][]+T[2][]=25+=26. In the second column, we do nothing. In Step.2, we deal 85

6 (a) Initialized T (d) Step (g) Step (b) Step (e) Step (c) Step (f) Step.5 5 (0,0) (,) 6 (0,0) (,) (0,0) (0,0) (0,0) (,) (0,0) (0,0) (0,0) 7 (,) (0,0) (0,0) (0,0) 3 (0,0) (0,0) (h) Assistant Array S Figure 2. Example of gettingδ LBobs andδ LBmis with the third column of T. T[3][3]=4 is replaced with T[3][3]+min{T[2][3], T[][]}=4+min{6, } = 5. The remaining steps are shown in Figure 2. Finally, from the 6-th column in Figure 2(g), we findδ 2 LB obs (Q, X obs ) is 4, which is the minimal element in the column. In order to find δ 2 LB mis (Q, X mis ), we first find the minimal value among those in the top of column,3,5,7 of T that is min{26,5,,0}. We find T[4][5]= is the minimal. Thus the imputed variable with mean value 6 will be matched to 6 in Q. Then S [4][5]=<, > indicates the previous match is in the first row and first column of T, where the corresponding imputation is with mean value 2 and is matched to 3 in Q. Thus, δ 2 LB mis (Q, X mis )=(3 x ) 2 +(6 x 2 ) 2, where x and x 2 are imputed random variables with mean value 2 and 6 respectively. 4.2 Probability triangle inequality This section presents a probability triangle inequality which is also used for results pruning during query process. Theorem 2 (Probability Triangle Inequality). Given a query Q and a multidimensional data object R (= R ). For a dimension incomplete data object X obs whose underlying complete version is X, we have: (a)p[δ(q, X)<r]<P[δ LB (R, X) δ(q, R)<r]; (b)p[δ(q, X)<r]>P[δ UB (R, X)+δ(Q, R)<r]. Proof. (a) From Theorem, we have P[δ LB (R, X) δ(q, R)< r] P[δ(R, X)<δ(Q, R)+r] (7) Thus, for any recovery version X rv of X, we have P[δ(R, X rv ) δ(q, R)<r] P[δ LB (R, X) δ(q, R)<r] (8) In metric space, the triangle inequality holds, thus Thus we have δ(r, X rv ) δ(q, R)<δ(Q, X rv ) (9) P[δ(Q, X rv )<r]<p[δ(r, X rv ) δ(q, R)<r] (20) Therefore, P[δ LB (R, X) δ(q, R)<r]>P[δ(Q, X)< r] (2) (b) The proof is similar to that of (a). Based on the theorem above, with the help of assistant data object R, some data objects in database can be determined to be true results (when P[δ UB (R, X)+δ(Q, R)< r] c) or true dismissals (when P[δ LB (R, X) δ(q, R)<r] c). Since the required dynamic programming computation can be finished in advance without knowing the query, this evaluating process can be done in O() time. Example 4: Given Q=(2, 6, 2, 5, 5), R=(3, 7,, 6, 5), X obs =(2, 4, 8), r = 2, and c = 0.2. Refer to Example 3, we knowδ LBobs (R, X obs )=4 andδ LBmis (R, X mis )=(3 x ) 2 +(6 x 2 ) 2, where x and x 2 are imputed random variables with mean value 2 and 6 respectively. The variances (σ 2 ) of them are both (the variance of X obs ). Thus,δ 2 LB mis (R, X mis )/σ 2 obeys noncentral chi-square distribution with degree of freedom 2 and noncentrality parameter [(3 2) 2 + (6 6) 2 ]/6.222=. Then, P[δ LB (R, X) δ(q, R) < r]=p{δ 2 LB mis (R, X mis )/σ 2 < [(r+ 2) 2 4]/σ 2 }=P[δ 2 LB mis (R, X mis )/σ 2 < 0.324]=0.379<c. Therefore, based on Theorem 2, we know P[δ(Q, X) < r]<c, indicating X obs is not a result of query Q. In order to get P[δ LB (R, X) δ(q, R)<r] and P[δ UB (R, X)+δ(Q, R)<r] efficiently, two sets of values need to be stored for each (R, X obs ) pair. One is δ LBobs (R, X obs ) andδ UBobs (R, X obs ). The other is the noncentrality parameters of random variableδ 2 LB mis (R, X mis )/σ 2 and δ 2 UB mis (R, X mis )/σ 2. So the number of assistant data objects controls the tradeoff between query processing time and storage. 4.3 Our similarity query method Our approach (termed PSQ in this work) for handing the problem of probabilistic similarity query on dimension incomplete data is described in Algorithm 2. In this algorithm, we utilize a gradual refinement searching strategy with aforementioned pruners to speed up the query process. Specifically, we first use assistant objects in S R to examine data objects in the database based on probability triangle inequality. Then confidence lower and upper bounds are used to further evaluate the remaining candidates. Only those data objects that the former two steps 86

7 cannot judge will be evaluated by the naive verification algorithm. Recall from Eq. 3 in Section 3, the straightforward way for confidence evaluation is to examine all possible recovery versions of the incomplete data. Here we also provide an optimized enumeration process that prunes out some cases safely. From Eq. 4, we find that ifδ(q obs, X obs ) r, P[δ 2 (Q mis, X mis )<r 2 δ 2 (Q obs, X obs )] will be 0. Intuitively, if we can judgeδ LBobs (Q, X obs ) rand thusδ(q obs, X obs ) r, there will be no need to examine X obs. Furthermore, according to Eq. 6, obviously, for a given query Q and an incomplete data object X obs ( X obs <), if there is a Q derived from Q ( Q X obs ) that satisfiesδ LBobs (Q, X obs ) r, then for all Q obs derived from Q ( Q obs X obs ) we have δ LBobs (Q obs, X obs) r. Thus, we only need to evaluate part of the C X obs recovery versions that yield confidence larger than 0 in the following way. Enumerate C X obs recovery versions with a recursive procedure: for Q obs derived from Q ( Q obs X obs ), ifδ LBobs (Q obs, X obs ) r, all Q obs derived from Q obs will not be evaluated. Due to space limitation, we will not discuss the details of this recursive procedure. When our algorithm is used, usually a large portion of time will be consumed by the naive verification process. In order to have higher efficiency, however, we can avoid the naive verification process and simply regard the remaining candidates as query results (or dismissals, depending on the requirements of query and ). Such a strategy is reasonable for some applications where the two probability confidence bounds are effective for pruning results. In this case, this simplified algorithm will not cause remarkable decrease in query results quality. Our experimental results shown in Section justify the effectiveness of such a simplified strategy. 5 Experimental study In this section, we present the experimental results. The goal of our experiments is to (a) evaluate the effectiveness and the efficiency of the overall method for probabilistic similarity query on dimension incomplete data and various key techniques proposed, (b) study the influence of different parameters and data sets on our method, (c) compare the performance of our approach with other solutions for handling dimension incomplete data. 5. Data sets Two real data sets are used in our experiments. The first one is the Standard and Poor 500 index historical stock data. This data set contains stock prices of about 54 companies collected over one year. We use the opening Algorithm 2 Probabilistic Similarity Query INPUT: the dimension incomplete database D, query Q, the set of assistant data objects S R. OUTPUT: the results set S result. : for all X in D do 2: for all R in S R do 3: if P[δ LB (R, X) δ(q, R)<r] cthen 4: X S result 5: goto to evaluate next X in D 6: else if P[δ UB (R, X)+δ(Q, R)<r] c then 7: X S result 8: goto to evaluate next X in D 9: end if 0: end for : if P[δ LB (Q, X)<r] cthen 2: X S result 3: else if P[δ UB (Q, X)<r]>cthen 4: X S result 5: else 6: do naive confidence evaluation 7: end if 8: end for price data of each stock, which is a vector of 25 dimensions. Since the final step of query process is very timeconsuming, we need to sample the original data and have a lower dimension data set. Thus, we construct a new one with 30 dimensions by segmenting the data in Standard and Poor 500 index historical stock data set, resulting in totally 54 8 = 4, 328 data objects with 30 dimensions (denoted by S&P500). Another data set contains 32-dimensional image features extracted from 68,040 images (denoted by IM- AGE) 2. For both data sets, their original data objects are complete. Similarity query results on the complete data are used as ground truth in evaluating and of our approach. We construct the dimension incomplete data set by randomly removing some dimensions of each data object. The number of missing data elements are controlled by missing ratio which is the ratio of the number of missing dimensions over the original dimensionality. Totally 00 data objects, which are randomly sampled from the data set, are used as queries. 5.2 Results and analysis 5.2. Effectiveness of probabilistic similarity query on dimension incomplete data Experiments in this section is to evaluate the effectiveness of our method on various data sets, and with various param

8 PSQ (c=0.) (a) 5% missing ratio 8 6 PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio PSQ (c=0.) (a) 5% missing ratio PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio Figure 3. Query on S&P500 data set Figure 5. Query on IMAGE data set PSQ (c=0.) (a) 5% missing ratio PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio PSQ (c=0.) (a) 5% missing ratio PSQ (c=0.) (b) 0% missing ratio PSQ (c=0.) (c) 5% missing ratio Figure 4. Query on S&P500 data set Figure 6. Query on IMAGE data set eter settings. Figure 3, 4, 5 and 6 show the quality of query result measured by and. It can be observed from the results that our method (PSQ) achieves a satisfactory performance in querying dimension incomplete data. Particularly, though both and decrease with the increase of missing ratio, even when 5% dimensions are missing, our method (PSQ) achieves more than and on S&P500 data set. For image histogram data set, if distance threshold and confidence threshold are well chosen, a good query quality can also be achieved. This justifies the usefulness of our approach in real applications. We also compare our approach with a simple method (), which () randomly removes some elements of the query to construct a new query with the same dimensionality as the dimension incomplete candidates in database, and (2) uses Euclidean distance to measure whether their distance is lower than threshold r. From the results, we can see that our method is able to better reflect the distance and thus achieves better query quality. Moreover, and on S&P500 data set are higher than those on the image histogram data set. This is due to the intrinsic characteristics of these two data sets. S&P500 data set holds the typical characteristics of time series and has an excellent correlation between the consecutive data elements, while image histogram data does not have this property. Therefore, the imputation method used in our experiment fits better for S&P500 data set. This also shows the importance of assigning a suitable imputation method in handling the dimension incomplete data Effect of the confidence threshold It can be observed from Figure 3, 4, 5 and 6 that value decreases and value increases with the increase of the confidence threshold. To make it clearer, Figure 7 shows the relationship between the confidence threshold c and / (missing ratio=0., r=60 for S&P500 and r= for IMAGE). This experiment also indicates that by setting a proper c, our method is able to achieve both good and good on real data sets Effectiveness of different pruners In this section, we study the usefulness of the four pruners proposed in this paper by examining their pruning power. It is measured by N de f inite /N processed where N processed is the number of the data objects in the database, and N de f inite is the number of data objects in the database judged as dismissals or search results by the pruner. Figure 8 shows the pruning power of probability triangle inequality with various number of assistant data objects (c=0.2). For S&P500 data, when only 0 assistant data objects were used, the pruning power is more than 60%. For image histogram data, in most cases, the pruning power of the probability triangle inequality is more than 20% with 20 assistant data objects. The results show that the probability triangle inequality has good pruning power in query process by involving only / confidence threshold (c) (a) S&P500 data set / confidence threshold (c) (b) IMAGE data set Figure 7. Confidence threshold vs 88

9 r=00 r=200 r= number of assistant objects (a) 5% missing ratio (S&P500) r=00 r=200 r= number of assistant objects (b) 0% missing ratio (S&P500) r=00 r=200 r= number of assistant objects (c) 5% missing ratio (S&P500).5 pruner pruner2 pruner3 pruner4.5 pruner4 pruner3 pruner2 pruner r=0. r=0.2 r= number of assistant objects (d) 5% missing ratio (IMAGE) r=0. r=0.2 r= number of assistant objects (e) 0% missing ratio (IMAGE) r=0. r=0.2 r= numver of assistant objects (f) 5% missing ratio (IMAGE) Figure 8. Pruning power of probability triangle inequality a few assistant data objects. Moreover, the performance will improve when more assistant data objects are available. But after the pruning power reaches a certain level, the increase of assistant data objects has no significant further impact on the pruning power. We also examined the pruning power of the four pruners proposed in this paper, including probability triangle inequality using confidence lower bound and confidence upper bound (denoted by pruner and pruner2 respectively), confidence lower and upper bounds (denoted by pruner3 and pruner4 respectively). Figure 9 shows the pruning power of each pruner with various r (missing ratio=0%, c=0., 20 assistant objects). Firstly, this justifies the usefulness of each pruner proposed in this work. Secondly, we can find for S&P500 data set, about 90% data in total can be pruned, which means only a small part of data need to do naive verification. For image histogram data set, in the worst case, there will be 50% data need to do naive verification. Thirdly, the pruning power of these four pruners are influenced by threshold r significantly. When specifying a smaller r (i.e., the user wants to get relatively a small amount of query results), more data are pruned by two lower bound based pruners pruner and pruner3. In contrast, a larger distance threshold produces a larger pruning power for pruner2 and pruner4. Since the time complexity of naive verification is poor, we study if naive verification is necessary. We try two simplified verification strategies: for data objects that the former four pruners cannot judge, strategy Pos simply outputs them as query results, and strategy Neg, by contrast, judges them as dismissals. Obviously, Pos might result in more false positives, while Neg might produce more false negatives. The query quality of these two strategies and doing naive verification (denoted by DoN) are shown in Table (c=0.). From the results, we can find for S&P500 data 0 S&P500 (r=40) IMAGE (r=) (a) S&P500 data set (b) IMAGE data set Figure 9. Pruning power of four pruners Table. Comparison of query quality missing ratio 5% 0% 5% Neg Pos DoN Neg Pos DoN set, the and without doing naive verification are very close to those doing naive verification. For image histogram data, however, the query quality depends more heavily on the naive verification. It can be concluded from this result that for some data sets, a high query quality can be achieved without utilizing the slow naive verification process Performance analysis There are mainly three steps in our approach: () pruning with probability triangle inequality; (2) pruning with confidence lower and upper bounds; (3) naive confidence verification. We test time costs of these steps respectively using S&P500 data set on a computer with 3.0GHz CPU and.0gb RAM, and average them over all queries. Results are shown in Figure 0. Particularly, naive confidence ver- time cost (microsecond) 0 5 confidence bounds pruning probability triangle inequality naive confidence verification number of missing elements Figure 0. Time cost 89

10 ification takes much longer time than the other two steps. However, Table in section indicates that the naive confidence verification process is not so necessary for handling S&P500 data set, which means the efficiency of the overall query processing system can be improved significantly without losing much in performance. 6 Conclusions This paper addresses the similarity query problem on dimension incomplete data, which is of both practical importance and technical challenge. We adopt a probability framework to model this problem. In order to solve this problem efficiently, an approach is introduced based on the proposed lower/upper confidence bounds and the probability triangle inequality. The proposed methods are proved to be theoretically correct. Given a query Q and a database containing dimension incomplete data X obs, compared with the brute force method whose time complexity is O( C X obs ), our method achieves a significant improvement where most data objects can be handled in O( X obs ( X obs ) 2 ) or even O() time. Experiments are conducted on two real data sets. The results indicate: () our approach achieves satisfactory performance in querying dimension incomplete data; (2) both the probability triangle inequality and the confidence bounds have a nice pruning power and improve query efficiency significantly. This verifies that our method is promising in handling dimension incomplete data. Our future work will focus on the following aspects. Since a probability triangle inequality holds, we plan to develop an index structure to make the query process faster. Besides that, it will be interesting and useful to extend our query strategy to fit for other similarity measurements. Acknowledgment The work was supported by NSFC , and 863 funding 2007AA0Z56. References [] C. C. Aggarwal and S. Parthasarathy. Mining massively incomplete data sets by conceptual reconstruction. In Proceedings ACM SIGKDD 0, pages , 200. [2] O. Benjelloun, A. Das, S. Alon, and H. J. Widom. Uldbs: Databases with uncertainty and lineage. In Proceedings of VLDB 06, pages , [3] B. Bollobas, G. Das, D. Gunopulos, and H. Mannila. Timeseries similarity problems and well-separated geometric sets. In Proceedings of SCG 97, pages , 997. [4] D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Olap over uncertain and imprecise data. In Proceedings of VLDB 05, pages , [5] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu. Indexing incomplete database. In Proceedings of EDBT 06, [6] R. Cheng, D. V. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In Proceedings of ACM SIGMOD 03, pages , [7] R. Cheng and S. Prabhakar. Managing uncertainty in sensor database. ACM SIGMOD Record, 32:4 46, [8] D. Gu and Y. Gao. Incremental gradient descent imputation method for missing data in learning classifier systems. In Proceedings of GECCO 05, pages 72 73, [9] J. Gu and X. Jin. Similarity search over incomplete symbolic sequences. In Proceedings of DEXA 07, pages , [0] M. Hua, J. Pei, W. Zhang, and X. Lin. Ranking queries on uncertain data: a probabilistic threshold approach. In Proceedings of ACM SIGMOD 08, pages , [] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. Mcdb: a monte carlo approach to managing uncertain data. In Proceedings of ACM SIGMOD 08, pages , [2] E. Keogh. Exact indexing of dynamic time warping. In Proceedings of VLDB 02, pages , [3] E. Keogh and M. Pazzani. Scaling up dynamic time warping to massive datasets. In Proceedings of ECML/PKDD 99, 999. [4] S. Khanna and W.-C. Tan. On computing functions with uncertainty. In Proceedings of ACM PODS 0, pages 7 82, 200. [5] K. Lakshminarayan, S. A. Harp, and T. Samad. Imputation of missing data in industrial databases. Applied Intelligence, : , 999. [6] R. A. Little and D. B. Rubin. Statistical analysis with missing data. Wiley Series in Probability and Statistics,st, 987. [7] B. C. Ooi, B. Chin, O. Cheng, C. H. Goh, and K. lee Tan. Fast high-dimensional data search in incomplete databases. In Proceedings of VLDB 98, 998. [8] R. K. Pearson. The problem of disguised missing data. ACM SIGKDD Explorations Newsletter, pages 83 92, [9] J. Pei, M. Hua, Y. Tao, and X. Lin. Query answering techniques on uncertain and probabilistic data: tutorial summary. In Proceedings of ACM SIGMOD 08, pages , [20] J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In Proceedings of VLDB 07, pages 5 26, [2] A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In Proceedings of ICDE 06, page 7, [22] I. Wasito and B.Mirkin. Nearest neighbour approach in the least-squares data imputation algorithms. Information Sciences: an International Journal, 69: 25, [23] D. Williams, X. Liao, Y. Xue, and L. Carin. Incompletedata classification using logistic regression. In Proceedings of ICML 05, pages ,

Searching Dimension Incomplete Databases

Searching Dimension Incomplete Databases IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 6, NO., JANUARY 3 Searching Dimension Incomplete Databases Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang Abstract

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Probabilistic Databases

Probabilistic Databases Probabilistic Databases Amol Deshpande University of Maryland Goal Introduction to probabilistic databases Focus on an overview of: Different possible representations Challenges in using them Probabilistic

More information

The τ-skyline for Uncertain Data

The τ-skyline for Uncertain Data CCCG 2014, Halifax, Nova Scotia, August 11 13, 2014 The τ-skyline for Uncertain Data Haitao Wang Wuzhou Zhang Abstract In this paper, we introduce the notion of τ-skyline as an alternative representation

More information

k-selection Query over Uncertain Data

k-selection Query over Uncertain Data k-selection Query over Uncertain Data Xingjie Liu 1 Mao Ye 2 Jianliang Xu 3 Yuan Tian 4 Wang-Chien Lee 5 Department of Computer Science and Engineering, The Pennsylvania State University, University Park,

More information

Introduction to Logistic Regression

Introduction to Logistic Regression Introduction to Logistic Regression Guy Lebanon Binary Classification Binary classification is the most basic task in machine learning, and yet the most frequent. Binary classifiers often serve as the

More information

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods) Machine Learning InstanceBased Learning (aka nonparametric methods) Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Non parametric CSE 446 Machine Learning Daniel Weld March

More information

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data

Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data VLDB Journal manuscript No. (will be inserted by the editor) Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data Jinchuan Chen Reynold Cheng Mohamed

More information

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments

Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Evaluation of Probabilistic Queries over Imprecise Data in Constantly-Evolving Environments Reynold Cheng, Dmitri V. Kalashnikov Sunil Prabhakar The Hong Kong Polytechnic University, Hung Hom, Kowloon,

More information

Probabilistic Similarity Search for Uncertain Time Series

Probabilistic Similarity Search for Uncertain Time Series Probabilistic Similarity Search for Uncertain Time Series Johannes Aßfalg, Hans-Peter Kriegel, Peer Kröger, Matthias Renz Ludwig-Maximilians-Universität München, Oettingenstr. 67, 80538 Munich, Germany

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

Shape of Gaussians as Feature Descriptors

Shape of Gaussians as Feature Descriptors Shape of Gaussians as Feature Descriptors Liyu Gong, Tianjiang Wang and Fang Liu Intelligent and Distributed Computing Lab, School of Computer Science and Technology Huazhong University of Science and

More information

Finding Robust Solutions to Dynamic Optimization Problems

Finding Robust Solutions to Dynamic Optimization Problems Finding Robust Solutions to Dynamic Optimization Problems Haobo Fu 1, Bernhard Sendhoff, Ke Tang 3, and Xin Yao 1 1 CERCIA, School of Computer Science, University of Birmingham, UK Honda Research Institute

More information

Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data

Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data Probabilistic Verifiers: Evaluating Constrained Nearest-Neighbor Queries over Uncertain Data Reynold Cheng, Jinchuan Chen, Mohamed Mokbel,Chi-YinChow Deptartment of Computing, Hong Kong Polytechnic University

More information

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE March 28, 2012 The exam is closed book. You are allowed a double sided one page cheat sheet. Answer the questions in the spaces provided on

More information

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory

Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Star-Structured High-Order Heterogeneous Data Co-clustering based on Consistent Information Theory Bin Gao Tie-an Liu Wei-ing Ma Microsoft Research Asia 4F Sigma Center No. 49 hichun Road Beijing 00080

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

CSE446: non-parametric methods Spring 2017

CSE446: non-parametric methods Spring 2017 CSE446: non-parametric methods Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin and Luke Zettlemoyer Linear Regression: What can go wrong? What do we do if the bias is too strong? Might want

More information

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons

Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons 2015 The University of Texas at Arlington. All Rights Reserved. Crowdsourcing Pareto-Optimal Object Finding by Pairwise Comparisons Abolfazl Asudeh, Gensheng Zhang, Naeemul Hassan, Chengkai Li, Gergely

More information

Composite Quantization for Approximate Nearest Neighbor Search

Composite Quantization for Approximate Nearest Neighbor Search Composite Quantization for Approximate Nearest Neighbor Search Jingdong Wang Lead Researcher Microsoft Research http://research.microsoft.com/~jingdw ICML 104, joint work with my interns Ting Zhang from

More information

Uncertain Time-Series Similarity: Return to the Basics

Uncertain Time-Series Similarity: Return to the Basics Uncertain Time-Series Similarity: Return to the Basics Dallachiesa et al., VLDB 2012 Li Xiong, CS730 Problem Problem: uncertain time-series similarity Applications: location tracking of moving objects;

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Anticipatory DTW for Efficient Similarity Search in Time Series Databases

Anticipatory DTW for Efficient Similarity Search in Time Series Databases Anticipatory DTW for Efficient Similarity Search in Time Series Databases Ira Assent Marc Wichterich Ralph Krieger Hardy Kremer Thomas Seidl RWTH Aachen University, Germany Aalborg University, Denmark

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries Anonymous Author(s) Affiliation Address email Abstract 1 2 3 4 5 6 7 8 9 10 11 12 Probabilistic

More information

Preserving Privacy in Data Mining using Data Distortion Approach

Preserving Privacy in Data Mining using Data Distortion Approach Preserving Privacy in Data Mining using Data Distortion Approach Mrs. Prachi Karandikar #, Prof. Sachin Deshpande * # M.E. Comp,VIT, Wadala, University of Mumbai * VIT Wadala,University of Mumbai 1. prachiv21@yahoo.co.in

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Decisions on Multivariate Time Series: Combining Domain Knowledge with Utility Maximization

Decisions on Multivariate Time Series: Combining Domain Knowledge with Utility Maximization Decisions on Multivariate Time Series: Combining Domain Knowledge with Utility Maximization Chun-Kit Ngan 1, Alexander Brodsky 2, and Jessica Lin 3 George Mason University cngan@gmu.edu 1, brodsky@gmu.edu

More information

Scalable Algorithms for Distribution Search

Scalable Algorithms for Distribution Search Scalable Algorithms for Distribution Search Yasuko Matsubara (Kyoto University) Yasushi Sakurai (NTT Communication Science Labs) Masatoshi Yoshikawa (Kyoto University) 1 Introduction Main intuition and

More information

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann

Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann Towards Indexing Functions: Answering Scalar Product Queries Arijit Khan, Pouya anki, Bojana Dimcheva, Donald Kossmann Systems Group ETH Zurich Moving Objects Intersection Finding Position at a future

More information

Dynamic Structures for Top-k Queries on Uncertain Data

Dynamic Structures for Top-k Queries on Uncertain Data Dynamic Structures for Top-k Queries on Uncertain Data Jiang Chen 1 and Ke Yi 2 1 Center for Computational Learning Systems, Columbia University New York, NY 10115, USA. criver@cs.columbia.edu 2 Department

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014 Decision Trees Machine Learning CSEP546 Carlos Guestrin University of Washington February 3, 2014 17 Linear separability n A dataset is linearly separable iff there exists a separating hyperplane: Exists

More information

Similarity and Dissimilarity

Similarity and Dissimilarity 1//015 Similarity and Dissimilarity COMP 465 Data Mining Similarity of Data Data Preprocessing Slides Adapted From : Jiawei Han, Micheline Kamber & Jian Pei Data Mining: Concepts and Techniques, 3 rd ed.

More information

Semantics of Ranking Queries for Probabilistic Data

Semantics of Ranking Queries for Probabilistic Data IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1 Semantics of Ranking Queries for Probabilistic Data Jeffrey Jestes, Graham Cormode, Feifei Li, and Ke Yi Abstract Recently, there have been several

More information

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams

Time-Series Analysis Prediction Similarity between Time-series Symbolic Approximation SAX References. Time-Series Streams Time-Series Streams João Gama LIAAD-INESC Porto, University of Porto, Portugal jgama@fep.up.pt 1 Time-Series Analysis 2 Prediction Filters Neural Nets 3 Similarity between Time-series Euclidean Distance

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks

Semantics of Ranking Queries for Probabilistic Data and Expected Ranks Semantics of Ranking Queries for Probabilistic Data and Expected Ranks Graham Cormode AT&T Labs Feifei Li FSU Ke Yi HKUST 1-1 Uncertain, uncertain, uncertain... (Probabilistic, probabilistic, probabilistic...)

More information

Finding Frequent Items in Probabilistic Data

Finding Frequent Items in Probabilistic Data Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008

More information

Metric-based classifiers. Nuno Vasconcelos UCSD

Metric-based classifiers. Nuno Vasconcelos UCSD Metric-based classifiers Nuno Vasconcelos UCSD Statistical learning goal: given a function f. y f and a collection of eample data-points, learn what the function f. is. this is called training. two major

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Empirical Discriminative Tensor Analysis for Crime Forecasting

Empirical Discriminative Tensor Analysis for Crime Forecasting Empirical Discriminative Tensor Analysis for Crime Forecasting Yang Mu 1, Wei Ding 1, Melissa Morabito 2, Dacheng Tao 3, 1 Department of Computer Science, University of Massachusetts Boston,100 Morrissey

More information

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case

Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Using Conservative Estimation for Conditional Probability instead of Ignoring Infrequent Case Masato Kikuchi, Eiko Yamamoto, Mitsuo Yoshida, Masayuki Okabe, Kyoji Umemura Department of Computer Science

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Recent advances in Time Series Classification

Recent advances in Time Series Classification Distance Shapelet BoW Kernels CCL Recent advances in Time Series Classification Simon Malinowski, LinkMedia Research Team Classification day #3 S. Malinowski Time Series Classification 21/06/17 1 / 55

More information

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT)

Metric Embedding of Task-Specific Similarity. joint work with Trevor Darrell (MIT) Metric Embedding of Task-Specific Similarity Greg Shakhnarovich Brown University joint work with Trevor Darrell (MIT) August 9, 2006 Task-specific similarity A toy example: Task-specific similarity A toy

More information

High-Dimensional Indexing by Distributed Aggregation

High-Dimensional Indexing by Distributed Aggregation High-Dimensional Indexing by Distributed Aggregation Yufei Tao ITEE University of Queensland In this lecture, we will learn a new approach for indexing high-dimensional points. The approach borrows ideas

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

Gaussian Mixture Distance for Information Retrieval

Gaussian Mixture Distance for Information Retrieval Gaussian Mixture Distance for Information Retrieval X.Q. Li and I. King fxqli, ingg@cse.cuh.edu.h Department of omputer Science & Engineering The hinese University of Hong Kong Shatin, New Territories,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Tighter Low-rank Approximation via Sampling the Leveraged Element

Tighter Low-rank Approximation via Sampling the Leveraged Element Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Research, India prajain@microsoft.com

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Nearest Neighbor Search for Relevance Feedback

Nearest Neighbor Search for Relevance Feedback Nearest Neighbor earch for Relevance Feedbac Jelena Tešić and B.. Manjunath Electrical and Computer Engineering Department University of California, anta Barbara, CA 93-9 {jelena, manj}@ece.ucsb.edu Abstract

More information

Probabilistic Frequent Itemset Mining in Uncertain Databases

Probabilistic Frequent Itemset Mining in Uncertain Databases Proc. 5th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD'9), Paris, France, 29. Probabilistic Frequent Itemset Mining in Uncertain Databases Thomas Bernecker, Hans-Peter Kriegel, Matthias

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering Michigan Technological University Houghton, Michigan

More information

Hybrid particle swarm algorithm for solving nonlinear constraint. optimization problem [5].

Hybrid particle swarm algorithm for solving nonlinear constraint. optimization problem [5]. Hybrid particle swarm algorithm for solving nonlinear constraint optimization problems BINGQIN QIAO, XIAOMING CHANG Computers and Software College Taiyuan University of Technology Department of Economic

More information

Collaborative Filtering. Radek Pelánek

Collaborative Filtering. Radek Pelánek Collaborative Filtering Radek Pelánek 2017 Notes on Lecture the most technical lecture of the course includes some scary looking math, but typically with intuitive interpretation use of standard machine

More information

Nearest Neighbor Search with Keywords in Spatial Databases

Nearest Neighbor Search with Keywords in Spatial Databases 776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,

More information

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression

More information

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates

Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Estimating the Selectivity of tf-idf based Cosine Similarity Predicates Sandeep Tata Jignesh M. Patel Department of Electrical Engineering and Computer Science University of Michigan 22 Hayward Street,

More information

An Approach to Classification Based on Fuzzy Association Rules

An Approach to Classification Based on Fuzzy Association Rules An Approach to Classification Based on Fuzzy Association Rules Zuoliang Chen, Guoqing Chen School of Economics and Management, Tsinghua University, Beijing 100084, P. R. China Abstract Classification based

More information

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra

Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE Artificial Intelligence Grad Project Dr. Debasis Mitra Factor Analysis (FA) Non-negative Matrix Factorization (NMF) CSE 5290 - Artificial Intelligence Grad Project Dr. Debasis Mitra Group 6 Taher Patanwala Zubin Kadva Factor Analysis (FA) 1. Introduction Factor

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng 1 Decision Trees 2 Instances Describable by Attribute-Value Pairs Target Function Is Discrete Valued Disjunctive Hypothesis May Be Required Possibly Noisy Training Data Examples Equipment or medical diagnosis

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Confidence Intervals for the Sample Mean

Confidence Intervals for the Sample Mean Confidence Intervals for the Sample Mean As we saw before, parameter estimators are themselves random variables. If we are going to make decisions based on these uncertain estimators, we would benefit

More information

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval Scribes: Ellis Weng, Andrew Owens February 11, 2010 1 Introduction In this lecture, we will introduce our second paradigm for

More information

cient and E ective Similarity Search based on Earth Mover s Distance

cient and E ective Similarity Search based on Earth Mover s Distance 36th International Conference on Very Large Data Bases E cient and E ective Similarity Search over Probabilistic Data based on Earth Mover s Distance Jia Xu 1, Zhenjie Zhang 2, Anthony K.H. Tung 2, Ge

More information

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques

Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Improving Performance of Similarity Measures for Uncertain Time Series using Preprocessing Techniques Mahsa Orang Nematollaah Shiri 27th International Conference on Scientific and Statistical Database

More information

IBM Research Report. A Lower Bound on the Euclidean Distance for Fast Nearest Neighbor Retrieval in High-dimensional Spaces

IBM Research Report. A Lower Bound on the Euclidean Distance for Fast Nearest Neighbor Retrieval in High-dimensional Spaces RC24859 (W0909-03) September 0, 2009 Computer Science IBM Research Report A Lower Bound on the Euclidean Distance for Fast Nearest Neighbor Retrieval in High-dimensional Spaces George Saon, Peder Olsen

More information

Group Pattern Mining Algorithm of Moving Objects Uncertain Trajectories

Group Pattern Mining Algorithm of Moving Objects Uncertain Trajectories INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL ISSN 1841-9836, 10(3):428-440, June, 2015. Group Pattern Mining Algorithm of Moving Objects Uncertain Trajectories S. Wang, L. Wu, F. Zhou, C.

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

DECISION TREE BASED QUALITATIVE ANALYSIS OF OPERATING REGIMES IN INDUSTRIAL PRODUCTION PROCESSES *

DECISION TREE BASED QUALITATIVE ANALYSIS OF OPERATING REGIMES IN INDUSTRIAL PRODUCTION PROCESSES * HUNGARIAN JOURNAL OF INDUSTRIAL CHEMISTRY VESZPRÉM Vol. 35., pp. 95-99 (27) DECISION TREE BASED QUALITATIVE ANALYSIS OF OPERATING REGIMES IN INDUSTRIAL PRODUCTION PROCESSES * T. VARGA, F. SZEIFERT, J.

More information

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig Multimedia Databases Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de 14 Indexes for Multimedia Data 14 Indexes for Multimedia

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Guaranteeing the Accuracy of Association Rules by Statistical Significance

Guaranteeing the Accuracy of Association Rules by Statistical Significance Guaranteeing the Accuracy of Association Rules by Statistical Significance W. Hämäläinen Department of Computer Science, University of Helsinki, Finland Abstract. Association rules are a popular knowledge

More information

Finding Top-k Preferable Products

Finding Top-k Preferable Products JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO., JANUARY 7 Finding Top-k Preferable Products Yu Peng, Raymond Chi-Wing Wong and Qian Wan Abstract The importance of dominance and skyline analysis has been

More information

On Improving the k-means Algorithm to Classify Unclassified Patterns

On Improving the k-means Algorithm to Classify Unclassified Patterns On Improving the k-means Algorithm to Classify Unclassified Patterns Mohamed M. Rizk 1, Safar Mohamed Safar Alghamdi 2 1 Mathematics & Statistics Department, Faculty of Science, Taif University, Taif,

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Advances in Locally Varying Anisotropy With MDS

Advances in Locally Varying Anisotropy With MDS Paper 102, CCG Annual Report 11, 2009 ( 2009) Advances in Locally Varying Anisotropy With MDS J.B. Boisvert and C. V. Deutsch Often, geology displays non-linear features such as veins, channels or folds/faults

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

P leiades: Subspace Clustering and Evaluation

P leiades: Subspace Clustering and Evaluation P leiades: Subspace Clustering and Evaluation Ira Assent, Emmanuel Müller, Ralph Krieger, Timm Jansen, and Thomas Seidl Data management and exploration group, RWTH Aachen University, Germany {assent,mueller,krieger,jansen,seidl}@cs.rwth-aachen.de

More information

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data

UAPD: Predicting Urban Anomalies from Spatial-Temporal Data UAPD: Predicting Urban Anomalies from Spatial-Temporal Data Xian Wu, Yuxiao Dong, Chao Huang, Jian Xu, Dong Wang and Nitesh V. Chawla* Department of Computer Science and Engineering University of Notre

More information

Day 5: Generative models, structured classification

Day 5: Generative models, structured classification Day 5: Generative models, structured classification Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 22 June 2018 Linear regression

More information

PROBABILISTIC SKYLINE QUERIES OVER UNCERTAIN MOVING OBJECTS. Xiaofeng Ding, Hai Jin, Hui Xu. Wei Song

PROBABILISTIC SKYLINE QUERIES OVER UNCERTAIN MOVING OBJECTS. Xiaofeng Ding, Hai Jin, Hui Xu. Wei Song Computing and Informatics, Vol. 32, 2013, 987 1012 PROBABILISTIC SKYLINE QUERIES OVER UNCERTAIN MOVING OBJECTS Xiaofeng Ding, Hai Jin, Hui Xu Services Computing Technology and System Lab Cluster and Grid

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Evaluation of probabilistic queries over imprecise data in. constantly-evolving environments

Evaluation of probabilistic queries over imprecise data in. constantly-evolving environments Information Systems 32 (2007) 104 130 www.elsevier.com/locate/infosys Evaluation of probabilistic queries over imprecise data in $, $$ constantly-evolving environments Reynold Cheng a,b,, Dmitri V. Kalashnikov

More information