[11] [3] [1], [12] HITS [4] [2] k-means[9] 2 [11] [3] [6], [13] 1 ( ) k n O(n k ) 1 2

Size: px

Start display at page:

Download "[11] [3] [1], [12] HITS [4] [2] k-means[9] 2 [11] [3] [6], [13] 1 ( ) k n O(n k ) 1 2"

Harriet Young
5 years ago
Views:

1 情報処理学会研究報告 1,a) IT Clustering by Clique Enumeration and Data Cleaning like Method Abstract: Recent development on information technology has made bigdata analysis more familiar in research and industrial areas. However, bigdata has big difficulties on diversity, other than its huge size. Data with much diversity is usually composed of many groups each of those has its original feature, thus data analysis from the global structures usually fails to capture the details of the data. To analyze the data correctly, capturing the group structure is important. Existing clustring algorithms and pattern mining algorithms aim to extract the group structures from the data. However, they usually find huge number of solutions, too few groups, many similar groups, or groups with large biased sizes, and often take long computation time. In this paper, we address the graph clustering problem, and discuss what are good graphs in that we can easily capture the group structures. From the discussion, we define a graph class that is a model of noiseless clarified graph. We then propose a data cleaning like method that modifies the given data graph to a clarified graph without breaking the group structures. We also show some mathematical and algorithmic properties for the graph class and the modifying algorithm. Keywords: data cleaning, data analysis, clique enumeration, pattern mining, clustering 1. Introduction Web , Hitotsubashi, Chiyoda, Tokyo , Japan , Ichiban-cho, Uegahara, Nishinomiya, Hyogo , Japan a) uno@nii.ac.jp 1

2 [11] [3] [1], [12] HITS [4] [2] k-means[9] 2 [11] [3] [6], [13] 1 ( ) k n O(n k ) 1 2

3 2. 2 E V ( V 1)/2 U G U δ δ- G v v u v u v v N(v) v d(v) v N(v) N[v] N(v) {v} w u v K d K (v) K v K min v K d K (v) 3. v u K K v u u v (k ) u v 2 u v k- : k <6 k- k- v u N[v] Nu k k- k- (u, v) E N[v] N[u] k k- (k) k 1 n k- O(n k ) : k S K K K u K v S N[v],N[u] N[u] N[v] k u v K K S k 3

4 k k n C k k n k n C k + n k = O(n k ) 3 n/3 O(n k ) k 1 n k- O(n k ) : [10] O(n ) 2 n k- k k =Θ(n) k 2 4 n n 2 /4+3n/2 k- 2 n/2 : G V = {1,...,n} i j i j = i +1 (i, j) G {2i, 2i 1} 2 n/2 u v N[u] N[v] = n 2 G V V 1 = {1, 3, 5,...,n 1} V 2 = {2, 4, 6,...,n} U 0 = V 2 U i = V 1 \{2i 1} {2i} 1 i n/2 U j G U j U j n/2 U j U j G =(V,E ) G V = n +(n/2+1) n/2 =n 2 /4+3n/2 U j (u, v) G (u, v) N[u] N[v] (1) (u, v) E n 2+n/2 (2) u, v V (u, v) E n 2 (3) u U j \ V v U j n (u, v) E (3) u U j \ V v U j n/2 (u, v) E n- G G G V G G 2 n/2 4. k- k- G =(V,E) G k- E = (u, v) N[v] N[u] k u v u v k- k- 1 γk K (γ +1)k/2 G k- : K 2 u v 2 (γk (γ +1)k )=γk k 2 4

5 N[u] N[v] k u v k- K K K (γ +1)/2γ all graphs clarified graphs 3 K γk (γ 1)k/3 δ- 3(1 δ) < ( γ 1 γ )2 k- K : M K M k- K M =(1 δ)γk(γk 1)/2 K v P d K (v) (γ +1)k/2 Q (γ +1)k/2 d K (v) (2γ +1)k/3 p = P q = Q M pq + p(p 1)/2 q 2M ((γ 1)k/2)p = 6M 2p (γ 1)k/3 (γ 1)k M 6M p( (γ 1)k 2p)+p(p 1) 2 = 6pM (γ 1)k 2p2 + p2 2 p 2 = 3 2 p2 + 12M (γ 1)k p 2(γ 1)k X = 12M (γ 1)k 2(γ 1)k p = X/3 M 3p + X =0 X 2 /6 M M M /M X2 /6 M (X 2 /6 (1 δ)((γ 1)k) 2 /2 X 2 < 3(1 δ)((γ 1)k) 2 X < 3(1 δ)(γ 1)k M /M < 1 X < 3(1 δ)(γ 1)k 12M (γ 1)k 2(γ 1)k < 3(1 δ)(γ 1)k 3(1 δ)γ(γk 1) 1 2 < 3(1 δ)(γ 1)k(γ 1) 3(1 δ)γ 2 k< 3(1 δ)(γ 1) 2 k γ 1 3(1 δ) < ( ) 2 γ 3 K γk (γ 1)k/3 3 5/6-γ >6 1/4 /(6 1/4 1) k- K : δ =5/6 1/6 1/4 < (γ 1)/γ γ >6 1/4 /(6 1/4 1) k- k- k- k- N[u] N[v] 1 0 sim(u, v) k- sim sim- sim- sim Jaccard N[u] N[v] θ v u N[u] N[v] 5

6 θ Algorithm DataPolishing SimilarityGraph (G =(V,E): graph, sim:similarity measure, τ:repeat number) 1. for i := 1 to τ 2. E := {(u, v) sim(u, v) =1} 3. E := E 4. end for 5. output G 5. Algorithm for Fast Data Polishing sim(u, v) O( V 2 ) V V : sim(u, v) =1 N[u] N[v] > 0 N[u] N[v] =0 N[u] N[v] N[u] N[v] = N[u] + N[v] N[u] N[v] N[u] \ N[v] = N[u] N[u] N[v] 2 N[u] N[v] u v w N[w] 3 N[u] N[v] N[u] w v N[w] u v N[u] N[v] w N[u] N[w] w N[w] v N[w] v 1 w v N[u] N[v] for each w N[u] do for each v N[w] s.t.v<u, intersection[v] :=intersection[v]+1 end for N[u] N[v] v v L v 0 1 v L L 0 O( V ) v L L O( L ) Algorithm NeighborIntersection (G =(V,E)) 1. intersection[u] :=0foreachu V 2. for each u V do 3. L := 4. for each w N[u] do 5. for each v N[w] s.t.v<udo 6. if intersection[v] =0then insert v to L 7. intersection[v] :=intersection[v]+1 8. end for 9. end for 6

7 10. for each v L output {v,u}; intersection[v] :=0 11. end for O( u V w N[u] {v N[w] v<u} ) u V w N[u] u u w u N[w] u N[w] {v N[w] v>u} = O( N[w] 2 ) Δ O( Δ2 )=O(Δ 2 V ) Web G i α α/i k c β w N[w] cα/w k + β N[w] 2 (cα/w k + β) 2 3 (cα/w k ) 2 + β 2 O( E )+ 3 (cα/w k ) 2 = O( E + α 2 (1/w 2k ) k = 1 O( E + α 2 log E ) k >1 O( E + α 2 ) α O( V 1/2 ) k =1 O( E log E ) O( E ) 4 G =(V,E) i c β cα/i k + β NeighborIntersection k =1 O( E + α 2 log E ) k >1 O( E + α 2 ) cα/i k + β 6. CREST [1] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen and A. I. Verkamo, Fast discovery of association rules, Advances in Knowledge Discovery and Data Mining, Chapter 12, AAAI Press / The MIT Press (1996). [2] C. Cortes and V. N. Vapnik, Support-Vector Networks, Machine Learning 20 (1995). [3] M. Girvan and M. E. J. Newman, Community Structure in Social and Biological Networks, Proc. Natl. Acad. Sci. USA 99, pp (2002) [4] P. Judea, Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach, Proceedings of the 7

8 Second National Conference on Artificial Intelligence, pp (1982). [5] C. D. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambridge University Press, (2008). [6] B. Goethals, the FIMI repository, (2003). [7] S. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Trawling the Web for Emerging Cyber- Communities, Proceedings of the Eighth International World Wide Web Conference, Toronto, Canada, [8] UCI Machine Learning Repository, [9] J. B. MacQueen, Some Methods for Classification and Analysis of Multivariate Observations, Berkeley Symposium on Mathematical Statistics and Probability, pp (1967). [10] K. Makino and T. Uno, New Algorithms for Enumerating All Maximal Cliques, LNCS 3111 (Proc. SWAT 2004), pp (2004). [11] G. Palla, I. Derènyi, I. Farkas, and T. Vicsek, Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society, Nature 435, pp (2005) [12] T. Uno, T. Asai, Y. Uchida, H. Arimura, An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases, LNAI 3245, pp (2004). [13] T.Uno,M.Kiyomi,H.Arimura, LCMver.2:Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets, Proc. IEEE ICDM 04 Workshop FIMI 04 (2004). [14] T. Uno, An Efficient Algorithm for Solving Pseudo Clique Enumeration Problem Algorithmica 56, pp (2010). 8

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases

An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno, Tatsuya Asai 2 3, Yuzo Uchida 2, and Hiroki Arimura 2 National Institute of Informatics, 2--2, Hitotsubashi,