Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection

Size: px

Start display at page:

Download "Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection"

Kristina Casey
6 years ago
Views:

1 Using Rank Propagation and Probabilistic for Link-based Luca Becchetti, Carlos Castillo,Debora Donato, Stefano Leonardi and Ricardo Baeza-Yates 2. Università di Roma La Sapienza Rome, Italy 2. Yahoo! Research Barcelona, Spain and Santiago, Chile May 9th,

2 What is on the Web? Information + Porn + On-line casinos + Free movies + Cheap software + Buy a MBA diploma + Prescription -free drugs + V!-4-gra + Get rich now now now!!! Graphic:

3 Web spam (keywords + links) Web spam (mostly keywords)

4 Search engine? Fake search engine

5 Problem: normal pages that are spam Link farms Single-level farms can be detected by searching groups of nodes sharing their out-links [Gibson et al., 25]

6 [Fetterly et al., 24] hypothesized that studying the distribution of statistics about pages could be a good way of detecting spam pages: in a number of these distributions, outlier values are associated with web spam Research goal Statistical analysis of link-based spam Metrics Graph algorithms All shortest paths, centrality, betweenness, clustering coefficient... Streamed algorithms Breadth-first and depth-first search Count of neighbors Symmetric algorithms (Strongly) connected components Approximate count of neighbors,, Linear Rank HITS, Salsa,

7 Test collection U.K. collection 8.5 million pages downloaded from the.uk domain 5,344 hosts manually classified (6% of the hosts) Classified entire hosts: V A few hosts are mixed: spam and non-spam pages X More coverage: sample covers 32% of the pages

8 Degree.4 In degree δ =.35 Normal Spam.3.2. Number of in links (δ = max. difference in C.D.F. plot) Degree.3 Out degree δ =.28 Normal Spam.2. 5 Number of out links

9 Edge reciprocity Reciprocity of max. PR page δ =.35 Normal Spam Fraction of reciprocal links Assortativity Degree / Degree of neighbors δ =.3 Normal Spam... Degree/Degree ratio of home page

10 Automatic classifier All of the following attributes in the home page and the page with the maximum, plus a binary variable indicating if they are the same page: - In-degree, out-degree - Fraction of reciprocal edges - Degree divided by degree of direct neighbors - Average and sum of in-degree of out-neighbors - Average and sum of out-degree of in-neighbors Decision tree 72.6% of detection rate, with 3.% false positives

11 Let P N N be the normalized link matrix of a graph Row-normalized No sinks Definition () Stationary state of: αp + ( α) N N N Follow links with probability α Random jump with probability α Maximum in the Host.3.2. Maximum of the site δ =.23 Normal Spam e 8 e 7 e 6 e 5.

12 Variance of Suggested in [Benczúr et al., 25] Variance of of in-neighbors.3.2. Stdev. of PR of Neighbors (Home) δ =.4 Normal Spam σ 2 of the logarithm of

13 Automatic classifier Features: degree-based plus the following in the home page and the page with maximum : - - In-degree/ - Out-degree/ - Standard deviation of of in-neighbors = σ 2 - σ 2 / Plus the of the home page divided by the of the page with the maximum. Decision tree 74.4% of detection rate, with 2.6% false positives (: 72.6% of detection, 3.% false positives)

14 [Gyöngyi et al., 24] A node with high, but far away from a core set of trusted nodes is suspicious Start from a set of trusted nodes, then do a random walk, returning to the set of trusted nodes with probability α at each step i Trusted nodes: data from score.4 score of home page δ =.59 Normal Spam.3.2. e 6.

15 / Estimated relative non spam mass δ = Normal Spam.3 score/ Automatic classifier attributes, plus the following in the home page and the page with maximum : - TrustScore - TrustScore/ (estimated relative non-spam mass) - TrustScore/In-degree Plus the TrustScore in the home page divided by the TrustScore in the page with the maximum. Decision tree 77.3% of detection rate, with 3.% false positives (-based: 74.4% of detection, 2.6% false positives)

16 Path-based formula for Given a path p = x, x 2,..., x t of length t = p branching(p) = d d 2 d t where d i are the out-degrees of the members of the path Explicit formula for [Newman et al., 2] r i (α) = p Path(,i) Path(, i) are incoming paths in node i ( α)α p branching(p) N

17 General functional ranking In general: r i (α) = p Path(,i) damping( p ) N branching(p) There are many choices for damping( p ), including simply a linear function that is as good as in practice [Baeza-Yates et al., 26] Reduce the direct contribution of the first levels of links: damping(t) = { t T Cα t t > T V No extra reading of the graph after

18 (T=2) / T=2 / δ = Normal Spam (T=2) / Max. change of Maximum change of δ =.29.2 Normal Spam max(trpr /TrPr) i+ i

19 Automatic classifier attributes, plus the following in the home page and the page with maximum : - Trunc(T =... 4) - Trunc(T = 4) / Trunc(T = 3) - Trunc(T = 3) / Trunc(T = 2) - Trunc(T = 2) / Trunc(T = ) - Trunc(T =... 4) / - Minimum, maximum and average of: Trunc(T = i)/trunc(t = i ) Plus the (T =... 4) of the home page divided by the same value in the page with the maximum. Decision tree 76.9% of detection rate, with 2.5% false positives (-based: 77.3% of detection, 3.% false positives)

20 Idea: count at different distances Frequency Number of different nodes at a given distance:.uk 8 mill. nodes.eu.int 86, nodes Distance Frequency Distance Average distance Average distance 4.9 clicks. clicks High and low-ranked pages are different Number of Nodes x Top % % Top 4% 5% Top 6% 7% Distance Areas below the curves are equal if we are in the same strongly-connected component

21 Probabilistic counting Target page Propagation of bits using the OR operation Count bits set to estimate Improvement of ANF algorithm [Palmer et al., 22] based on probabilistic counting [Flajolet and Martin, 985] General algorithm Require: N: number of nodes, d: distance, k: bits : for node :... N, bit:... k do 2: INIT(node,bit) 3: end for 4: for distance :... d do {Iteration step} 5: Aux k 6: for src :... N do {Follow links in the graph} 7: for all links from src to dest do 8: Aux[dest] Aux[dest] OR V[src, ] 9: end for : end for : V Aux 2: end for 3: for node:... N do {Estimate } 4: Supporters[node] ESTIMATE( V[node, ] ) 5: end for 6: return Supporters

22 Our estimator Initialize all bits to one with probability( ɛ ) Estimator: neighbors(node) = log ( ɛ) ones(node) Adaptive estimation Repeat the above process for ɛ = /2, /4, /8,..., and look for the transitions from more than ( /e)k ones to less than ( /e)k ones. k Convergence Fraction of nodes with estimates % 9% 8% 7% 6% 5% 4% 3% 2% % % Iteration d= d=2 d=3 d=4 d=5 d=6 d=7 d=8

23 Error rate Average Relative Error b i 768 b i 52 b i 52 b i 768 b i 832 b i Ours 64 bits, epsilon only estimator Ours 64 bits, combined estimator ANF 24 bits 24 iterations (576 b i) ANF 24 bits 48 iterations (52 b i) 96 b i 576 b i 52 b i 26 b i 344 b i 48 b i 832 b i 96 b i 52 b i 26 b i 344 b i 48 b i Distance Hosts at distance 4.4 Hosts at Distance Exactly 4 δ =.39 Normal Spam.3.2. S S 4 3

24 Minimum change of.4 Minimum change of δ =.39 Normal Spam min(s 2 /S, S 3 /S 2, S 4 /S 3 ) Automatic classifier attributes, plus the following in the home page and the page with maximum : - Supporters at Supporters at / - Supporters at i / Supporters at i (for i =..4) - Minimum, maximum and average of: Supporters at i / Supporters at i (for i =..4) - (Supporters at i - Supporters at i ) / Plus the number of at distance in the home page divided by the same feature in the page with the maximum. Decision tree 78.9% of detection rate, with 2.5% false positives (TruncPR: 76.9% of detection, 2.5% false positives) (: 77.7% of detection, 3.% false positives)

25 Summary of classifiers Detection False Metrics rate positives Degree (D) 72.6% 3.% D + (P) 74.4% 2.6% D + P % 3.% D + P + Trunc. 76.9% 2.5% D + P + Est. of Supporters 78.9% 2.5% All attributes 8.4% 2.8% All attributes (more rules) 8.8%.% Comparison Content-based analysis [Ntoulas et al., 26] has shown 86.2% detection rate with 2.2% false positives Single-attribute classifier with [Gyöngyi et al., 24]: 5.% detection rate, 3.4% error in our sample SpamRank [Benczúr et al., 25] reports similar detection rates

26 Top metrics. Binary variable indicating if homepage is the page with maximum of the site 2. Edge reciprocity 3. Different hosts at distance 4 4. Different hosts at distance 3 5. Minimum change of (different hosts) 6. Different hosts at distance 2 7. Pagerank (T=) / 8. score divided by 9. Different hosts at distance. Pagerank (T=2) / V Link-based statistics to detect 8% of spam X No magic bullet in link analysis X Precision still low compared to spam filters V Measure both home page and max. page V Host-based counts are important Next step: combine link analysis and content analysis

27 Thank you! Baeza-Yates, R., Boldi, P., and Castillo, C. (26). Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of SIGIR, Seattle, Washington, USA. ACM Press. Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. (26). Using rank propagation and probabilistic counting for link-based spam detection. Technical report, DELIS Dynamically Evolving, Large-Scale Information Systems. Benczúr, A. A., Csalogány, K., Sarlós, T., and Uher, M. (25). Spamrank: fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web, Chiba, Japan.

28 Fetterly, D., Manasse, M., and Najork, M. (24). Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the seventh workshop on the Web and databases (WebDB), pages 6, Paris, France. Flajolet, P. and Martin, N. G. (985). Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 3(2): Gibson, D., Kumar, R., and Tomkins, A. (25). Discovering large dense subgraphs in massive graphs. In VLDB 5: Proceedings of the 3st international conference on Very large data bases, pages VLDB Endowment. Gyöngyi, Z., Molina, H. G., and Pedersen, J. (24). Combating web spam with trustrank. In Proceedings of the Thirtieth International Conference on Very Large Data Bases (VLDB), pages , Toronto, Canada. Morgan Kaufmann. Newman, M. E., Strogatz, S. H., and Watts, D. J. (2). Random graphs with arbitrary degree distributions and their applications. Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2). Ntoulas, A., Najork, M., and Manasse, M. a. (26). Detecting spam web pages through content analysis. In To appear in proceedings of the World Wide Web conference, Edinburgh, Scotland.

29 Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (22). ANF: a fast and scalable tool for data mining in massive graphs. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 8 9, New York, NY, USA. ACM Press.

The Choice of a Damping Function for Propagating Importance in Link-Based Ranking

The Choice of a Damping Function for Propagating Importance in Link-Based Ranking Technical report RI-DSI N. 305-05 Dipartimento di Scienze dell Informazione, Università degli Studi di Milano. 2005 A version