Latent Dirichlet Allocation in Web Spam Filtering

Size: px

Start display at page:

Download "Latent Dirichlet Allocation in Web Spam Filtering"

Bartholomew Jackson
5 years ago
Views:

1 Latent Dirichlet Allocation in Web Spam Filtering István Bíró Jácint Szabó Anrás A. Benczúr Data Mining an Web search Research Group, Informatics Laboratory Computer an Automation Research Institute of the Hungarian Acaemy of Sciences {ibiro, jacint, ABSTRACT Latent Dirichlet allocation (LDA) (Blei, Ng, Joran 2003) is a metho in information retrieval to moel the content an topics of a collection of ocuments. In this paper we apply a moification of LDA, the novel multi-corpus LDA technique, for supervise webspam classification. We treat the web-corpus in site-level, creating a bag-of-wors ocument for every site, an run LDA both on the collection of sites labele as spam, an as non-spam. In this way spam an non-spam topics are create in the training phase. In the test phase we take the union of these topics, an an unseen site is eeme spam if its total spam topic istribution is above a threshol. As far as we know, this is the first web retrieval application of LDA. Compare to other text classification methos, our LDA-base implementation works surprisingly well. We test it on the UK2007-WEBSPAM corpus an reach an absolute improvement of 55% in F- measure over the publicly available feature set of Web Spam Challenge Categories an Subject Descriptors H.3 [Information Systems]: Information Storage an Retrieval; I.2.7 [Computing Methoologies]: Artificial Intelligence Natural Language Processing General Terms Text Analysis, Feature Selection, Document Classification, Document Clustering, Information Retrieval Keywors Web content spam, latent Dirichlet allocation, topic istribution 1. INTRODUCTION This work was supporte by the EU FP7 project LiWA - Living Web Archives an by grants OTKA NK 72845, ASTOR NKFP 2/004/05 AIRWeb 2008 Several topic-base text analysis techniques have been evelope in the fiel of information retrieval. Hofmann [14] introuce probabilistic latent semantic inexing (PLSI), which is a generative, graphical moel enhancing latent semantic analysis by a souner probabilistic moel. Though PLSI ha promising results, it suffers from two limitations: the number of parameters is linear in the number of ocuments, an it is not possible to make inference for unseen ata. These issues are aresse by latent Dirichlet allocation evelope by Blei, Ng an Joran [3]. LDA is a fully generative graphical moel for analyzing the latent topics of ocuments. LDA moels every topic as a istribution over the terms of the vocabulary, an every ocument as a istribution over the topics. These istributions are sample from Dirichlet istributions. The wors of the ocuments are rawn from the term-istribution of a topic which was just rawn for this wor from the topic-istribution of the ocument. Note that by term we mean an element of the vocabulary, an by wor we mean a position in the ocument. There are several methos evelope for making inference in LDA, like variational expectation maximization [3], expectation propagation [15], an Gibbs sampling [10]. LDA is an intensively stuie moel, an the experiments are really impressive compare to other known information retrieval techniques. LDA has several applications, like in entity resolution [2], frau etection in telecommunication systems [23], image processing [6, 7, 20] an a-hoc retrieval [21], in aition to the fiel of text retrieval. To our best knowlege our experiments provie the first application of LDA in webspam filtering, an even in web-retrieval. In this paper we introuce an apply a slight moification of LDA, calle multi-corpus LDA, which works as follows. Assume we have a semi-supervise ocument-classification task with m classes, which are some kin of semantic themes. As usual, part of the ocuments are labele with the appropriate themes. Run LDA for all m sub-corpora consisting of the ientically labele ocuments, then take the union of the resulting m topic-collections, an make inference w.r.t. this aggregate collection of topics for every unseen ocument. The fraction of theme-i topics in the topic istribution of may serve as a measure to what extent belongs to theme i. For a more etaile escription, see Subsection 2.1.

2 In our experiments we run multi-corpus LDA with 2 classes an thus 2 corpora: spam an non-spam. The inference is performe using Gibbs-sampling. The fraction of spamtopics in the topic-istribution of an unseen ocument serves as a so-calle spam-measure in the ecision of being spam or not. Ientifying an preventing spam is cite as one of the top challenges in web search engines in [13, 18]. As all major search engines incorporate anchor text an link analysis algorithms into their ranking schemes, web spam appears in sophisticate forms that manipulate content as well as linkage [11]. In aition to link-base, spam hunters use a variety of content base features [8, 16, 9] to etect web spam; a recent measurement of their combination appears in [4]. Content base spam etection methos worke well for web spam alreay at the Web Spam Challenge 2007 [5]. 1.1 Data set, evaluation an experimental setup We test the multi-corpus LDA metho in combination with the Web Spam Challenge 2008 public features [1], an with the pivote tf.if scheme features introuce in [19]. The improvement in F-measure over the public features is 55% with C4.5 an over the pivote text features is 18% with Bayes-net. The calculations were performe with the machine learning toolkit Weka [22]. For a etaile explanation, see Section METHOD First we escribe latent Dirichlet allocation [3]. For a etaile elaboration, we refer to Heinrich [12]. We have a vocabulary V consisting of terms, a set T of k topics an n ocuments of arbitrary length. For every topic z a istribution ϕ z on V is sample from Dir(β), where β R V + is a smoothing parameter. Similarly, for every ocument a istribution ϑ on T is sample from Dir(α), where α R T + is a smoothing parameter. The wors of the ocuments are rawn as follows: for every wor-position of ocument a topic s rawn from ϑ, an then a term is rawn from ϕ z an fille into the position. LDA can be thought of as a Bayesian-network, see Figure 1. α β ϕ Dir(β) k ϑ Dir(α) z ϑ w ϕ z n Figure 1: LDA as a Bayesian network One metho for making inference for LDA is Gibbs-sampling [10]. Gibbs-sampling is a Monte Carlo Markov-chain algorithm for sampling from a joint istribution p(x), x R n, if all conitional istributions p(x i x i) are known (x i = (x 1,..., x i 1, x i+1,..., x n)). The k th transition x (k) x (k+1) of the Markov-chain is generate as follows. Choose an inex 1 i n (usually i = k mo n), an let x (k+1) = x (k) everywhere except at inex i where x (k+1) i is sample from p(x i x (k) i ). In LDA the goal is to estimate the istribution p(z w) for z T P, w V P where P enotes the set of wor-positions in the ocuments. Thus in the Gibbs-sampling one has to calculate p( z i, w) for i P. This has an efficiently computable close form (for a euction, see [12]) p( z i, w) = nt i 1 + β ti n zi 1 + P n 1 + α t βt n 1 + P. (1) z αz Here is the ocument of position i, t i is the actual wor in position i, n t i is the number of positions with topic an term t i, n zi is the number of positions with topic, n is the number of topics in ocument, an n is the length of ocument. After an enough number of iterations we arrive at a topic assignment sample z. Knowing z, we can estimate ϕ an ϑ as an ϕ z,t = ϑ,z = nt z + β t n z + P t βt (2) nz + α z n + P z αz. (3) For an unseen ocument the ϑ topic-istribution can be estimate exactly as in (3) once we have a sample from its wor-topic assignment z. Sampling z can be performe with a similar metho as before, but now only for the positions i in : p( z i, w) = ñt i 1 + β ti ñ zi 1 + P n 1 + α t βt n 1 + P. (4) z αz The notation ñ refers to the union of the whole corpus an ocument. We will make use of the observation that the first factor in prouct (4) is approximately equal to ϕ zi,t i by (2). 2.1 Multi-corpus LDA As outline in the introuction, in the multi-corpus setting we run two istinct LDA s: one in the collection of labele spam sites with k (s) topics, calle spam topics, an one in the collection of labele non-spam sites with k (n) topics, calle non-spam topics. The vocabulary is the same for both LDA s. After both inferences have been one, we have termistributions for all k (s) + k (n) topics. From now on we think of the obtaine term-istributions of the unifie collection of spam an non-spam topics as if they were estimate from only one presume corpus. To make inference for an unseen ocument, we shall perform Gibbssampling on this presume unique istribution using (4). Observe that the ñ terms in the first factor of the prouct are not known, as also the topic-assignments of the presume corpus are not known. Thus we approximate this first factor

3 by ϕ zi,t i, an p( z i, w) by p( z i, w) ϕ zi,t i n 1 + α n 1 + P z αz, (5) which is a computable expression. To istinguish this metho from the original Gibbs-sampling inference evelope in [10], we call it the multi-corpus inference. This is applie only to unseen ocuments. After an enough number of iterations we calculate ϑ as in (3), an efine spam-measure to be P {ϑ,z : s a spam topic}. A trivial classification is to classify as spam if its spam-measure is above a certain threshol. An appealing property of multi-corpus LDA is that after the inferences have been mae for all m corpora (m = 2 in our case), we can calculate the theme-istribution for every unseen ocument without using any avance classification methos. This is in contrast to, say, using term frequency features in classification tasks, like the pivote tf.if scheme features consiere in Subsection 2.2. Another avantage is that the spam-measures are really expressive. The Dirichlet parameter β was chosen to be constant 0.1 throughout, while α (s) = 50/k (s), α (n) = 50/k (n), an uring multi-corpus inference α was constant 50/(k (s) + k (n) ) (these are the efault values in GibbsLDA++). We stoppe Gibbs-sampling after 2000 steps for inference on the training ata, an after 1000 steps for the multi-corpus inference for an unseen ocument. The number of topics were chosen to be k (s) = 2,5, 10, 20 an k (n) = 10, 20, 50. Consequently, we performe altogether 12 runs, an thus obtaine 12 one-imensional spammeasures as features. Base on the trivial classification using these one-imensional features, we selecte the 5 best choices, whose F-measure curves are shown in Figure 2, an F-measures an ROC values are shown in Table 1. The best result belongs to the choice k (s) = 10 an k (n) = 50, where the F-measure is This is a notable improvement of 97% over the baseline F-measure of the public features with C4.5 classifier (see Table 2). In the rest of the experiments we use only these 5 pairs of topic-numbers, calle chosen spam-measures. 2.2 Text features We also use the pivote tf.if scheme features introuce in [19]. For ocument an wor w this feature is efine as 1 + ln(tf (w)) (1 s) + s l() avgl ln «N + 1, f(w) where tf (w) is the frequency of w in, s is the slope, chosen to be 0.2 in our case, avgl is the average length of a ocument, l() is the length of, N is the number of ocuments an finally, f(w) is the frequency of w in the whole corpus. After ropping those wors which appear in less then 10% of the ocuments, we are left with 3500 text features. 3. EXPERIMENTS The ata-set we use is the UK2007-WEBSPAM corpus. We kept only the labele sites, which amount to 203 sites labele as spam, an 3589 labele as non-spam. We aggregate the wors an meta keywors appearing in all pages of the sites to form one ocument per site in a bag-of-wors moel, that is only the numbers of wors appearing in the ocument are interesting an not their orer. We kept only the alphanumeric characters an the hyphen, an then remove all wors containing a hyphen, where this hyphen oes not connect two alphabetical wors. Next we elete all stopwors enumerate in the list of manuals/onix/stopwors1.html, an then we use a treetagger software from projekte/corplex/treetagger/. After this proceure a total number of terms forme the vocabulary. We use Phan s GibbsLDA++ C++-coe [17] to run LDA an a moifie version of it to run the multi-corpus inference for unseen ocuments in multi-corpus LDA. We applie 5-fol cross valiation. Two LDA s were run on the training spam an non-spam corpora, an then multi-corpus inference were mae to the test ocuments by Gibbs-sampling as in (5). Then we calculate the spam-measure for every test site. Figure 2: F-measure curves of 5 spam-measures with the trivial classification Figure 2 inicates that the multi-corpus metho is robust to the parameter of topic-numbers, as the performance oes not really change by changing the topic-numbers. As one can expect, the maximum of such an F-measure curve is approximately k (s) /k (n). pair of topic-numbers FMR ROC Table 1: F-measures an ROC values of 5 spammeasures with the trivial classification We put the 5 chosen spam-measures, the public features an the text features as escribe in Subsection 2.2 into Weka [22]. The classifiers were C4.5 an Bayes-network. The F- measures an ROC values are shown in Table 2 in the form

4 FMR/ROC. As the combine text features performe surprisingly well, we teste them with plain SVM an with a cost-sensitive SVM classifier with cost-ratio 7, too. In all cases, the aition of the 5 spam-features resulte in remarkable improvement in F-measure, with 18% over the text features with Bayes-net, an with 55% over the public feature-set with C4.5. Note that though text features performe very well with SVM, multi-corpus LDA raise F-measure from to with 7-SVM. feature-set C4.5 Bayes-net la / / public / / public & la / / text / / text & la / / text & public & la / / feature-set SVM 7-SVM text / / text & la / / text & public & la / / Table 2: FMR/ROC values or ifferent classifiers We also performe the trivial classification for the 5 chosen spam-measure with the UK2006-WEBSPAM corpus. Here we have 2125 sites labele as spam, an 8082 labele as non-spam. The parameters an the setup was the same as above. The results can be seen at Table 3. Recall that a spam-measure is easy to calculate with multi-corpus inference, after the term-istributions of the spam an non-spam topics are available, thus the results are quite goo for a trivial classification. It is also apparent that multi-corpus LDA is robust to the parameter of topic-numbers. pair of topic-numbers FMR ROC Table 3: F-measures an ROC values for UK2006- WEBSPAM 4. REFERENCES [1] Web Spam Challenge 2008, [2] I. Bhattacharya an L. Getoor. A latent irichlet moel for unsupervise entity resolution. SIAM International Conference on Data Mining, [3] D. Blei, A. Ng, an M. Joran. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5): , [4] C. Castillo, D. Donato, A. Gionis, V. Murock, an F. Silvestri. Know your neighbors: Web spam etection using the web topology. Technical report, DELIS Dynamically Evolving, Large-Scale Information Systems, [5] G. Cormack. Content-base Web Spam Detection. In Proceeings of the 3r international workshop on aversarial information retrieval, AIRWeb 2007, Banff, Alberta, Canaa, [6] P. Elango an K. Jayaraman. Clustering Images Using the Latent Dirichlet Allocation Moel. [7] L. Fei-Fei an P. Perona. A Bayesian hierarchical moel for learning natural scene categories. Proc. CVPR, 5, [8] D. Fetterly, M. Manasse, an M. Najork. Spam, amn spam, an statistics Using statistical analysis to locate spam web pages. In Proceeings of the 7th International Workshop on the Web an Databases (WebDB), pages 1 6, Paris, France, [9] D. Fetterly, M. Manasse, an M. Najork. Detecting phrase-level uplication on the worl wie web. In Proceeings of the 28th ACM International Conference on Research an Development in Information Retrieval (SIGIR), Salvaor, Brazil, [10] T. Griffiths. Fining scientific topics. Proceeings of the National Acaemy of Sciences, 101(suppl 1): , [11] Z. Gyöngyi an H. Garcia-Molina. Web spam taxonomy. In Proceeings of the 1st International Workshop on Aversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, [12] G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, [13] M. R. Henzinger, R. Motwani, an C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11 22, [14] T. Hofmann. Probabilistic latent semantic inexing. Proceeings of the 22n annual international ACM SIGIR conference on Research an evelopment in information retrieval, pages 50 57, [15] T. Minka an J. Lafferty. Expectation-propagation for the generative aspect moel. Uncertainty in Artificial Intelligence (UAI), [16] A. Ntoulas, M. Najork, M. Manasse, an D. Fetterly. Detecting spam web pages through content analysis. In Proceeings of the 15th International Worl Wie Web Conference (WWW), pages 83 92, Einburgh, Scotlan, [17] X.-H. Phan. [18] A. Singhal. Challenges in running a commercial search engine. In IBM Search an Collaboration Seminar IBM Haifa Labs, [19] A. Singhal, G. Salton, M. Mitra, an C. Buckley. Document length normalization. Information Processing an Management, 32(5): , [20] J. Sivic, B. Russell, A. Efros, A. Zisserman, an W. Freeman. Discovering Objects an their Localization in Images. Computer Vision, ICCV Tenth IEEE International Conference on, 1, [21] X. Wei an W. Croft. LDA-base ocument moels for a-hoc retrieval. Proceeings of the 29th annual international ACM SIGIR conference on Research an evelopment in information retrieval, pages , [22] I. H. Witten an E. Frank. Data Mining: Practical

5 Machine Learning Tools an Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, secon eition, June [23] D. Xing an M. Girolami. Employing Latent Dirichlet Allocation for frau etection in telecommunications. Pattern Recognition Letters, 28(13): , 2007.

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de