Latent Dirichlet Allocation in Web Spam Filtering

Size: px
Start display at page:

Download "Latent Dirichlet Allocation in Web Spam Filtering"

Transcription

1 Latent Dirichlet Allocation in Web Spam Filtering István Bíró Jácint Szabó Anrás A. Benczúr Data Mining an Web search Research Group, Informatics Laboratory Computer an Automation Research Institute of the Hungarian Acaemy of Sciences {ibiro, jacint, ABSTRACT Latent Dirichlet allocation (LDA) (Blei, Ng, Joran 2003) is a metho in information retrieval to moel the content an topics of a collection of ocuments. In this paper we apply a moification of LDA, the novel multi-corpus LDA technique, for supervise webspam classification. We treat the web-corpus in site-level, creating a bag-of-wors ocument for every site, an run LDA both on the collection of sites labele as spam, an as non-spam. In this way spam an non-spam topics are create in the training phase. In the test phase we take the union of these topics, an an unseen site is eeme spam if its total spam topic istribution is above a threshol. As far as we know, this is the first web retrieval application of LDA. Compare to other text classification methos, our LDA-base implementation works surprisingly well. We test it on the UK2007-WEBSPAM corpus an reach an absolute improvement of 55% in F- measure over the publicly available feature set of Web Spam Challenge Categories an Subject Descriptors H.3 [Information Systems]: Information Storage an Retrieval; I.2.7 [Computing Methoologies]: Artificial Intelligence Natural Language Processing General Terms Text Analysis, Feature Selection, Document Classification, Document Clustering, Information Retrieval Keywors Web content spam, latent Dirichlet allocation, topic istribution 1. INTRODUCTION This work was supporte by the EU FP7 project LiWA - Living Web Archives an by grants OTKA NK 72845, ASTOR NKFP 2/004/05 AIRWeb 2008 Several topic-base text analysis techniques have been evelope in the fiel of information retrieval. Hofmann [14] introuce probabilistic latent semantic inexing (PLSI), which is a generative, graphical moel enhancing latent semantic analysis by a souner probabilistic moel. Though PLSI ha promising results, it suffers from two limitations: the number of parameters is linear in the number of ocuments, an it is not possible to make inference for unseen ata. These issues are aresse by latent Dirichlet allocation evelope by Blei, Ng an Joran [3]. LDA is a fully generative graphical moel for analyzing the latent topics of ocuments. LDA moels every topic as a istribution over the terms of the vocabulary, an every ocument as a istribution over the topics. These istributions are sample from Dirichlet istributions. The wors of the ocuments are rawn from the term-istribution of a topic which was just rawn for this wor from the topic-istribution of the ocument. Note that by term we mean an element of the vocabulary, an by wor we mean a position in the ocument. There are several methos evelope for making inference in LDA, like variational expectation maximization [3], expectation propagation [15], an Gibbs sampling [10]. LDA is an intensively stuie moel, an the experiments are really impressive compare to other known information retrieval techniques. LDA has several applications, like in entity resolution [2], frau etection in telecommunication systems [23], image processing [6, 7, 20] an a-hoc retrieval [21], in aition to the fiel of text retrieval. To our best knowlege our experiments provie the first application of LDA in webspam filtering, an even in web-retrieval. In this paper we introuce an apply a slight moification of LDA, calle multi-corpus LDA, which works as follows. Assume we have a semi-supervise ocument-classification task with m classes, which are some kin of semantic themes. As usual, part of the ocuments are labele with the appropriate themes. Run LDA for all m sub-corpora consisting of the ientically labele ocuments, then take the union of the resulting m topic-collections, an make inference w.r.t. this aggregate collection of topics for every unseen ocument. The fraction of theme-i topics in the topic istribution of may serve as a measure to what extent belongs to theme i. For a more etaile escription, see Subsection 2.1.

2 In our experiments we run multi-corpus LDA with 2 classes an thus 2 corpora: spam an non-spam. The inference is performe using Gibbs-sampling. The fraction of spamtopics in the topic-istribution of an unseen ocument serves as a so-calle spam-measure in the ecision of being spam or not. Ientifying an preventing spam is cite as one of the top challenges in web search engines in [13, 18]. As all major search engines incorporate anchor text an link analysis algorithms into their ranking schemes, web spam appears in sophisticate forms that manipulate content as well as linkage [11]. In aition to link-base, spam hunters use a variety of content base features [8, 16, 9] to etect web spam; a recent measurement of their combination appears in [4]. Content base spam etection methos worke well for web spam alreay at the Web Spam Challenge 2007 [5]. 1.1 Data set, evaluation an experimental setup We test the multi-corpus LDA metho in combination with the Web Spam Challenge 2008 public features [1], an with the pivote tf.if scheme features introuce in [19]. The improvement in F-measure over the public features is 55% with C4.5 an over the pivote text features is 18% with Bayes-net. The calculations were performe with the machine learning toolkit Weka [22]. For a etaile explanation, see Section METHOD First we escribe latent Dirichlet allocation [3]. For a etaile elaboration, we refer to Heinrich [12]. We have a vocabulary V consisting of terms, a set T of k topics an n ocuments of arbitrary length. For every topic z a istribution ϕ z on V is sample from Dir(β), where β R V + is a smoothing parameter. Similarly, for every ocument a istribution ϑ on T is sample from Dir(α), where α R T + is a smoothing parameter. The wors of the ocuments are rawn as follows: for every wor-position of ocument a topic s rawn from ϑ, an then a term is rawn from ϕ z an fille into the position. LDA can be thought of as a Bayesian-network, see Figure 1. α β ϕ Dir(β) k ϑ Dir(α) z ϑ w ϕ z n Figure 1: LDA as a Bayesian network One metho for making inference for LDA is Gibbs-sampling [10]. Gibbs-sampling is a Monte Carlo Markov-chain algorithm for sampling from a joint istribution p(x), x R n, if all conitional istributions p(x i x i) are known (x i = (x 1,..., x i 1, x i+1,..., x n)). The k th transition x (k) x (k+1) of the Markov-chain is generate as follows. Choose an inex 1 i n (usually i = k mo n), an let x (k+1) = x (k) everywhere except at inex i where x (k+1) i is sample from p(x i x (k) i ). In LDA the goal is to estimate the istribution p(z w) for z T P, w V P where P enotes the set of wor-positions in the ocuments. Thus in the Gibbs-sampling one has to calculate p( z i, w) for i P. This has an efficiently computable close form (for a euction, see [12]) p( z i, w) = nt i 1 + β ti n zi 1 + P n 1 + α t βt n 1 + P. (1) z αz Here is the ocument of position i, t i is the actual wor in position i, n t i is the number of positions with topic an term t i, n zi is the number of positions with topic, n is the number of topics in ocument, an n is the length of ocument. After an enough number of iterations we arrive at a topic assignment sample z. Knowing z, we can estimate ϕ an ϑ as an ϕ z,t = ϑ,z = nt z + β t n z + P t βt (2) nz + α z n + P z αz. (3) For an unseen ocument the ϑ topic-istribution can be estimate exactly as in (3) once we have a sample from its wor-topic assignment z. Sampling z can be performe with a similar metho as before, but now only for the positions i in : p( z i, w) = ñt i 1 + β ti ñ zi 1 + P n 1 + α t βt n 1 + P. (4) z αz The notation ñ refers to the union of the whole corpus an ocument. We will make use of the observation that the first factor in prouct (4) is approximately equal to ϕ zi,t i by (2). 2.1 Multi-corpus LDA As outline in the introuction, in the multi-corpus setting we run two istinct LDA s: one in the collection of labele spam sites with k (s) topics, calle spam topics, an one in the collection of labele non-spam sites with k (n) topics, calle non-spam topics. The vocabulary is the same for both LDA s. After both inferences have been one, we have termistributions for all k (s) + k (n) topics. From now on we think of the obtaine term-istributions of the unifie collection of spam an non-spam topics as if they were estimate from only one presume corpus. To make inference for an unseen ocument, we shall perform Gibbssampling on this presume unique istribution using (4). Observe that the ñ terms in the first factor of the prouct are not known, as also the topic-assignments of the presume corpus are not known. Thus we approximate this first factor

3 by ϕ zi,t i, an p( z i, w) by p( z i, w) ϕ zi,t i n 1 + α n 1 + P z αz, (5) which is a computable expression. To istinguish this metho from the original Gibbs-sampling inference evelope in [10], we call it the multi-corpus inference. This is applie only to unseen ocuments. After an enough number of iterations we calculate ϑ as in (3), an efine spam-measure to be P {ϑ,z : s a spam topic}. A trivial classification is to classify as spam if its spam-measure is above a certain threshol. An appealing property of multi-corpus LDA is that after the inferences have been mae for all m corpora (m = 2 in our case), we can calculate the theme-istribution for every unseen ocument without using any avance classification methos. This is in contrast to, say, using term frequency features in classification tasks, like the pivote tf.if scheme features consiere in Subsection 2.2. Another avantage is that the spam-measures are really expressive. The Dirichlet parameter β was chosen to be constant 0.1 throughout, while α (s) = 50/k (s), α (n) = 50/k (n), an uring multi-corpus inference α was constant 50/(k (s) + k (n) ) (these are the efault values in GibbsLDA++). We stoppe Gibbs-sampling after 2000 steps for inference on the training ata, an after 1000 steps for the multi-corpus inference for an unseen ocument. The number of topics were chosen to be k (s) = 2,5, 10, 20 an k (n) = 10, 20, 50. Consequently, we performe altogether 12 runs, an thus obtaine 12 one-imensional spammeasures as features. Base on the trivial classification using these one-imensional features, we selecte the 5 best choices, whose F-measure curves are shown in Figure 2, an F-measures an ROC values are shown in Table 1. The best result belongs to the choice k (s) = 10 an k (n) = 50, where the F-measure is This is a notable improvement of 97% over the baseline F-measure of the public features with C4.5 classifier (see Table 2). In the rest of the experiments we use only these 5 pairs of topic-numbers, calle chosen spam-measures. 2.2 Text features We also use the pivote tf.if scheme features introuce in [19]. For ocument an wor w this feature is efine as 1 + ln(tf (w)) (1 s) + s l() avgl ln «N + 1, f(w) where tf (w) is the frequency of w in, s is the slope, chosen to be 0.2 in our case, avgl is the average length of a ocument, l() is the length of, N is the number of ocuments an finally, f(w) is the frequency of w in the whole corpus. After ropping those wors which appear in less then 10% of the ocuments, we are left with 3500 text features. 3. EXPERIMENTS The ata-set we use is the UK2007-WEBSPAM corpus. We kept only the labele sites, which amount to 203 sites labele as spam, an 3589 labele as non-spam. We aggregate the wors an meta keywors appearing in all pages of the sites to form one ocument per site in a bag-of-wors moel, that is only the numbers of wors appearing in the ocument are interesting an not their orer. We kept only the alphanumeric characters an the hyphen, an then remove all wors containing a hyphen, where this hyphen oes not connect two alphabetical wors. Next we elete all stopwors enumerate in the list of manuals/onix/stopwors1.html, an then we use a treetagger software from projekte/corplex/treetagger/. After this proceure a total number of terms forme the vocabulary. We use Phan s GibbsLDA++ C++-coe [17] to run LDA an a moifie version of it to run the multi-corpus inference for unseen ocuments in multi-corpus LDA. We applie 5-fol cross valiation. Two LDA s were run on the training spam an non-spam corpora, an then multi-corpus inference were mae to the test ocuments by Gibbs-sampling as in (5). Then we calculate the spam-measure for every test site. Figure 2: F-measure curves of 5 spam-measures with the trivial classification Figure 2 inicates that the multi-corpus metho is robust to the parameter of topic-numbers, as the performance oes not really change by changing the topic-numbers. As one can expect, the maximum of such an F-measure curve is approximately k (s) /k (n). pair of topic-numbers FMR ROC Table 1: F-measures an ROC values of 5 spammeasures with the trivial classification We put the 5 chosen spam-measures, the public features an the text features as escribe in Subsection 2.2 into Weka [22]. The classifiers were C4.5 an Bayes-network. The F- measures an ROC values are shown in Table 2 in the form

4 FMR/ROC. As the combine text features performe surprisingly well, we teste them with plain SVM an with a cost-sensitive SVM classifier with cost-ratio 7, too. In all cases, the aition of the 5 spam-features resulte in remarkable improvement in F-measure, with 18% over the text features with Bayes-net, an with 55% over the public feature-set with C4.5. Note that though text features performe very well with SVM, multi-corpus LDA raise F-measure from to with 7-SVM. feature-set C4.5 Bayes-net la / / public / / public & la / / text / / text & la / / text & public & la / / feature-set SVM 7-SVM text / / text & la / / text & public & la / / Table 2: FMR/ROC values or ifferent classifiers We also performe the trivial classification for the 5 chosen spam-measure with the UK2006-WEBSPAM corpus. Here we have 2125 sites labele as spam, an 8082 labele as non-spam. The parameters an the setup was the same as above. The results can be seen at Table 3. Recall that a spam-measure is easy to calculate with multi-corpus inference, after the term-istributions of the spam an non-spam topics are available, thus the results are quite goo for a trivial classification. It is also apparent that multi-corpus LDA is robust to the parameter of topic-numbers. pair of topic-numbers FMR ROC Table 3: F-measures an ROC values for UK2006- WEBSPAM 4. REFERENCES [1] Web Spam Challenge 2008, [2] I. Bhattacharya an L. Getoor. A latent irichlet moel for unsupervise entity resolution. SIAM International Conference on Data Mining, [3] D. Blei, A. Ng, an M. Joran. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5): , [4] C. Castillo, D. Donato, A. Gionis, V. Murock, an F. Silvestri. Know your neighbors: Web spam etection using the web topology. Technical report, DELIS Dynamically Evolving, Large-Scale Information Systems, [5] G. Cormack. Content-base Web Spam Detection. In Proceeings of the 3r international workshop on aversarial information retrieval, AIRWeb 2007, Banff, Alberta, Canaa, [6] P. Elango an K. Jayaraman. Clustering Images Using the Latent Dirichlet Allocation Moel. [7] L. Fei-Fei an P. Perona. A Bayesian hierarchical moel for learning natural scene categories. Proc. CVPR, 5, [8] D. Fetterly, M. Manasse, an M. Najork. Spam, amn spam, an statistics Using statistical analysis to locate spam web pages. In Proceeings of the 7th International Workshop on the Web an Databases (WebDB), pages 1 6, Paris, France, [9] D. Fetterly, M. Manasse, an M. Najork. Detecting phrase-level uplication on the worl wie web. In Proceeings of the 28th ACM International Conference on Research an Development in Information Retrieval (SIGIR), Salvaor, Brazil, [10] T. Griffiths. Fining scientific topics. Proceeings of the National Acaemy of Sciences, 101(suppl 1): , [11] Z. Gyöngyi an H. Garcia-Molina. Web spam taxonomy. In Proceeings of the 1st International Workshop on Aversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, [12] G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, [13] M. R. Henzinger, R. Motwani, an C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11 22, [14] T. Hofmann. Probabilistic latent semantic inexing. Proceeings of the 22n annual international ACM SIGIR conference on Research an evelopment in information retrieval, pages 50 57, [15] T. Minka an J. Lafferty. Expectation-propagation for the generative aspect moel. Uncertainty in Artificial Intelligence (UAI), [16] A. Ntoulas, M. Najork, M. Manasse, an D. Fetterly. Detecting spam web pages through content analysis. In Proceeings of the 15th International Worl Wie Web Conference (WWW), pages 83 92, Einburgh, Scotlan, [17] X.-H. Phan. [18] A. Singhal. Challenges in running a commercial search engine. In IBM Search an Collaboration Seminar IBM Haifa Labs, [19] A. Singhal, G. Salton, M. Mitra, an C. Buckley. Document length normalization. Information Processing an Management, 32(5): , [20] J. Sivic, B. Russell, A. Efros, A. Zisserman, an W. Freeman. Discovering Objects an their Localization in Images. Computer Vision, ICCV Tenth IEEE International Conference on, 1, [21] X. Wei an W. Croft. LDA-base ocument moels for a-hoc retrieval. Proceeings of the 29th annual international ACM SIGIR conference on Research an evelopment in information retrieval, pages , [22] I. H. Witten an E. Frank. Data Mining: Practical

5 Machine Learning Tools an Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, secon eition, June [23] D. Xing an M. Girolami. Employing Latent Dirichlet Allocation for frau etection in telecommunications. Pattern Recognition Letters, 28(13): , 2007.

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words Hanna M. Wallach Cavenish Laboratory, University of Cambrige, Cambrige CB3 0HE, UK hmw26@cam.ac.u Abstract Some moels of textual corpora employ text generation methos involving n-gram statistics, while

More information

Probabilistic Latent Semantic Analysis

Probabilistic Latent Semantic Analysis Probabilistic Latent Semantic Analysis Dan Oneaţă 1 Introduction Probabilistic Latent Semantic Analysis (plsa) is a technique from the category of topic models. Its main goal is to model cooccurrence information

More information

Generative learning methods for bags of features

Generative learning methods for bags of features Generative learning methos for bags of features Moel the robability of a bag of features given a class Many slies aate from Fei-Fei Li, Rob Fergus, an Antonio Torralba Generative methos We ill cover to

More information

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling

Collapsed Gibbs and Variational Methods for LDA. Example Collapsed MoG Sampling Case Stuy : Document Retrieval Collapse Gibbs an Variational Methos for LDA Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 7 th, 0 Example

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Collapsed Variational Inference for HDP

Collapsed Variational Inference for HDP Collapse Variational Inference for HDP Yee W. Teh Davi Newman an Max Welling Publishe on NIPS 2007 Discussion le by Iulian Pruteanu Outline Introuction Hierarchical Bayesian moel for LDA Collapse VB inference

More information

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu

CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, and Tony Wu CUSTOMER REVIEW FEATURE EXTRACTION Heng Ren, Jingye Wang, an Tony Wu Abstract Popular proucts often have thousans of reviews that contain far too much information for customers to igest. Our goal for the

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure

Classifying Biomedical Text Abstracts based on Hierarchical Concept Structure Classifying Biomeical Text Abstracts base on Hierarchical Concept Structure Rozilawati Binti Dollah an Masai Aono Abstract Classifying biomeical literature is a ifficult an challenging tas, especially

More information

Distinguish between different types of scenes. Matching human perception Understanding the environment

Distinguish between different types of scenes. Matching human perception Understanding the environment Scene Recognition Adriana Kovashka UTCS, PhD student Problem Statement Distinguish between different types of scenes Applications Matching human perception Understanding the environment Indexing of images

More information

Latent Dirichlet Allocation Based Multi-Document Summarization

Latent Dirichlet Allocation Based Multi-Document Summarization Latent Dirichlet Allocation Based Multi-Document Summarization Rachit Arora Department of Computer Science and Engineering Indian Institute of Technology Madras Chennai - 600 036, India. rachitar@cse.iitm.ernet.in

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling Case Stuy 5: Mixe Membership Moeling LDA Collapse Gibbs Sampler, VariaNonal Inference Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox May 8 th, 05 Emily Fox 05 Task : Mixe

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Document Classification with Latent Dirichlet Allocation

Document Classification with Latent Dirichlet Allocation Document Classification with Latent Dirichlet Allocation István Bíró Supervisor: András Lukács Ph.D. Eötvös Loránd University Faculty of Informatics Department of Information Sciences Informatics Ph.D.

More information

Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection

Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection Using Rank Propagation and Probabilistic for Link-based Luca Becchetti, Carlos Castillo,Debora Donato, Stefano Leonardi and Ricardo Baeza-Yates 2. Università di Roma La Sapienza Rome, Italy 2. Yahoo! Research

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations : DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation

More information

A Continuous-Time Model of Topic Co-occurrence Trends

A Continuous-Time Model of Topic Co-occurrence Trends A Continuous-Time Model of Topic Co-occurrence Trends Wei Li, Xuerui Wang and Andrew McCallum Department of Computer Science University of Massachusetts 140 Governors Drive Amherst, MA 01003-9264 Abstract

More information

A Variational Approach to Semi-Supervised Clustering

A Variational Approach to Semi-Supervised Clustering A Variational Approach to Semi-Supervise Clustering Peng Li, Yiming Ying an Colin Campbell Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, Unite Kingom Abstract We present

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection

A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection A Pairwise Document Analysis Approach for Monolingual Plagiarism Detection Introuction Plagiarism: Unauthorize use of Text, coe, iea, Plagiarism etection research area has receive increasing attention

More information

Estimating Causal Direction and Confounding Of Two Discrete Variables

Estimating Causal Direction and Confounding Of Two Discrete Variables Estimating Causal Direction an Confouning Of Two Discrete Variables This inspire further work on the so calle aitive noise moels. Hoyer et al. (2009) extene Shimizu s ientifiaarxiv:1611.01504v1 [stat.ml]

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation

Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation Relative Performance Guarantees for Approximate Inference in Latent Dirichlet Allocation Indraneel Muherjee David M. Blei Department of Computer Science Princeton University 3 Olden Street Princeton, NJ

More information

Topic Modeling: Beyond Bag-of-Words

Topic Modeling: Beyond Bag-of-Words University of Cambridge hmw26@cam.ac.uk June 26, 2006 Generative Probabilistic Models of Text Used in text compression, predictive text entry, information retrieval Estimate probability of a word in a

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Using Both Latent and Supervised Shared Topics for Multitask Learning

Using Both Latent and Supervised Shared Topics for Multitask Learning Using Both Latent and Supervised Shared Topics for Multitask Learning Ayan Acharya, Aditya Rawal, Raymond J. Mooney, Eduardo R. Hruschka UT Austin, Dept. of ECE September 21, 2013 Problem Definition An

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

Modeling User Rating Profiles For Collaborative Filtering

Modeling User Rating Profiles For Collaborative Filtering Modeling User Rating Profiles For Collaborative Filtering Benjamin Marlin Department of Computer Science University of Toronto Toronto, ON, M5S 3H5, CANADA marlin@cs.toronto.edu Abstract In this paper

More information

Level Construction of Decision Trees in a Partition-based Framework for Classification

Level Construction of Decision Trees in a Partition-based Framework for Classification Level Construction of Decision Trees in a Partition-base Framework for Classification Y.Y. Yao, Y. Zhao an J.T. Yao Department of Computer Science, University of Regina Regina, Saskatchewan, Canaa S4S

More information

Applying hlda to Practical Topic Modeling

Applying hlda to Practical Topic Modeling Joseph Heng lengerfulluse@gmail.com CIST Lab of BUPT March 17, 2013 Outline 1 HLDA Discussion 2 the nested CRP GEM Distribution Dirichlet Distribution Posterior Inference Outline 1 HLDA Discussion 2 the

More information

topic modeling hanna m. wallach

topic modeling hanna m. wallach university of massachusetts amherst wallach@cs.umass.edu Ramona Blei-Gantz Helen Moss (Dave's Grandma) The Next 30 Minutes Motivations and a brief history: Latent semantic analysis Probabilistic latent

More information

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

Prior-Based Dual Additive Latent Dirichlet Allocation for User-Item Connected Documents

Prior-Based Dual Additive Latent Dirichlet Allocation for User-Item Connected Documents Proceeings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Prior-Base Dual Aitive Latent Dirichlet Allocation for User-Item Connecte Documents Wei Zhang Jianyong

More information

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion

Hybrid Fusion for Biometrics: Combining Score-level and Decision-level Fusion Hybri Fusion for Biometrics: Combining Score-level an Decision-level Fusion Qian Tao Raymon Velhuis Signals an Systems Group, University of Twente Postbus 217, 7500AE Enschee, the Netherlans {q.tao,r.n.j.velhuis}@ewi.utwente.nl

More information

Comparing Relevance Feedback Techniques on German News Articles

Comparing Relevance Feedback Techniques on German News Articles B. Mitschang et al. (Hrsg.): BTW 2017 Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 301 Comparing Relevance Feedback Techniques on German News Articles Julia

More information

Web Search and Text Mining. Lecture 16: Topics and Communities

Web Search and Text Mining. Lecture 16: Topics and Communities Web Search and Tet Mining Lecture 16: Topics and Communities Outline Latent Dirichlet Allocation (LDA) Graphical models for social netorks Eploration, discovery, and query-ansering in the contet of the

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Optimization of Geometries by Energy Minimization

Optimization of Geometries by Energy Minimization Optimization of Geometries by Energy Minimization by Tracy P. Hamilton Department of Chemistry University of Alabama at Birmingham Birmingham, AL 3594-140 hamilton@uab.eu Copyright Tracy P. Hamilton, 1997.

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

Similarity Measures for Categorical Data A Comparative Study. Technical Report

Similarity Measures for Categorical Data A Comparative Study. Technical Report Similarity Measures for Categorical Data A Comparative Stuy Technical Report Department of Computer Science an Engineering University of Minnesota 4-92 EECS Builing 200 Union Street SE Minneapolis, MN

More information

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification Mandar Dixit, Nikhil Rasiwasia, Nuno Vasconcelos Department of Electrical and Computer Engineering University of California,

More information

SUPERVISED MULTI-MODAL TOPIC MODEL FOR IMAGE ANNOTATION

SUPERVISED MULTI-MODAL TOPIC MODEL FOR IMAGE ANNOTATION SUPERVISE MULTI-MOAL TOPIC MOEL FOR IMAGE AOTATIO Thu Hoai Tran 2 and Seungjin Choi 12 1 epartment of Computer Science and Engineering POSTECH Korea 2 ivision of IT Convergence Engineering POSTECH Korea

More information

Improving Topic Models with Latent Feature Word Representations

Improving Topic Models with Latent Feature Word Representations Improving Topic Models with Latent Feature Word Representations Dat Quoc Nguyen Joint work with Richard Billingsley, Lan Du and Mark Johnson Department of Computing Macquarie University Sydney, Australia

More information

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification

Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification 3 IEEE International Conference on Computer Vision Class-Specific Simplex-Latent Dirichlet Allocation for Image Classification Mandar Dixit, Nikhil Rasiwasia, Nuno Vasconcelos Department of Electrical

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

MATHEMATICAL REPRESENTATION OF REAL SYSTEMS: TWO MODELLING ENVIRONMENTS INVOLVING DIFFERENT LEARNING STRATEGIES C. Fazio, R. M. Sperandeo-Mineo, G.

MATHEMATICAL REPRESENTATION OF REAL SYSTEMS: TWO MODELLING ENVIRONMENTS INVOLVING DIFFERENT LEARNING STRATEGIES C. Fazio, R. M. Sperandeo-Mineo, G. MATHEMATICAL REPRESENTATION OF REAL SYSTEMS: TWO MODELLING ENIRONMENTS INOLING DIFFERENT LEARNING STRATEGIES C. Fazio, R. M. Speraneo-Mineo, G. Tarantino GRIAF (Research Group on Teaching/Learning Physics)

More information

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION By DEBARSHI ROY A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

More information

Global Behaviour Inference using Probabilistic Latent Semantic Analysis

Global Behaviour Inference using Probabilistic Latent Semantic Analysis Global Behaviour Inference using Probabilistic Latent Semantic Analysis Jian Li, Shaogang Gong, Tao Xiang Department of Computer Science Queen Mary College, University of London, London, E1 4NS, UK {jianli,

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

Topic Models. Charles Elkan November 20, 2008

Topic Models. Charles Elkan November 20, 2008 Topic Models Charles Elan elan@cs.ucsd.edu November 20, 2008 Suppose that we have a collection of documents, and we want to find an organization for these, i.e. we want to do unsupervised learning. One

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS Yannick DEVILLE Université Paul Sabatier Laboratoire Acoustique, Métrologie, Instrumentation Bât. 3RB2, 8 Route e Narbonne,

More information

Introduction to Machine Learning

Introduction to Machine Learning How o you estimate p(y x)? Outline Contents Introuction to Machine Learning Logistic Regression Varun Chanola April 9, 207 Generative vs. Discriminative Classifiers 2 Logistic Regression 2 3 Logistic Regression

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Content-based Recommendation

Content-based Recommendation Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3

More information

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can

More information

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Lawrence Livermore National Laboratory Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs Keith Henderson and Tina Eliassi-Rad keith@llnl.gov and eliassi@llnl.gov This work was performed

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

The Press-Schechter mass function

The Press-Schechter mass function The Press-Schechter mass function To state the obvious: It is important to relate our theories to what we can observe. We have looke at linear perturbation theory, an we have consiere a simple moel for

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University

More information

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Markov Topic Models. Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052

Markov Topic Models. Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 98052 Chong Wang Computer Science Dept. Princeton University Princeton, NJ 08540 Bo Thiesson, Christopher Meek Microsoft Research One Microsoft Way Redmond, WA 9805 David Blei Computer Science Dept. Princeton

More information

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing

Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing Note on Algorithm Differences Between Nonnegative Matrix Factorization And Probabilistic Latent Semantic Indexing 1 Zhong-Yuan Zhang, 2 Chris Ding, 3 Jie Tang *1, Corresponding Author School of Statistics,

More information

HTM: A Topic Model for Hypertexts

HTM: A Topic Model for Hypertexts HTM: A Topic Model for Hypertexts Congkai Sun Department of Computer Science Shanghai Jiaotong University Shanghai, P. R. China martinsck@hotmail.com Bin Gao Microsoft Research Asia No.49 Zhichun Road

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

Evaluation Methods for Topic Models

Evaluation Methods for Topic Models University of Massachusetts Amherst wallach@cs.umass.edu April 13, 2009 Joint work with Iain Murray, Ruslan Salakhutdinov and David Mimno Statistical Topic Models Useful for analyzing large, unstructured

More information

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang

Matrix Factorization & Latent Semantic Analysis Review. Yize Li, Lanbo Zhang Matrix Factorization & Latent Semantic Analysis Review Yize Li, Lanbo Zhang Overview SVD in Latent Semantic Indexing Non-negative Matrix Factorization Probabilistic Latent Semantic Indexing Vector Space

More information

Topic Modeling Using Latent Dirichlet Allocation (LDA)

Topic Modeling Using Latent Dirichlet Allocation (LDA) Topic Modeling Using Latent Dirichlet Allocation (LDA) Porter Jenkins and Mimi Brinberg Penn State University prj3@psu.edu mjb6504@psu.edu October 23, 2017 Porter Jenkins and Mimi Brinberg (PSU) LDA October

More information

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets

Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Proceeings of the 4th East-European Conference on Avances in Databases an Information Systems ADBIS) 200 Estimation of the Maximum Domination Value in Multi-Dimensional Data Sets Eleftherios Tiakas, Apostolos.

More information

Collaborative topic models: motivations cont

Collaborative topic models: motivations cont Collaborative topic models: motivations cont Two topics: machine learning social network analysis Two people: " boy Two articles: article A! girl article B Preferences: The boy likes A and B --- no problem.

More information

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S

IPSJ SIG Technical Report Vol.2014-MPS-100 No /9/25 1,a) 1 1 SNS / / / / / / Time Series Topic Model Considering Dependence to Multiple Topics S 1,a) 1 1 SNS /// / // Time Series Topic Model Considering Dependence to Multiple Topics Sasaki Kentaro 1,a) Yoshikawa Tomohiro 1 Furuhashi Takeshi 1 Abstract: This pater proposes a topic model that considers

More information

arxiv: v1 [cs.lg] 5 Jul 2010

arxiv: v1 [cs.lg] 5 Jul 2010 The Latent Bernoulli-Gauss Model for Data Analysis Amnon Shashua Gabi Pragier School of Computer Science and Engineering Hebrew University of Jerusalem arxiv:1007.0660v1 [cs.lg] 5 Jul 2010 Abstract We

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Neural Networks. Tobias Scheffer Universität Potsam Institut für Informatik Lehrstuhl Maschinelles Lernen Neural Networks Tobias Scheffer Overview Neural information processing. Fee-forwar networks. Training fee-forwar networks, back

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

A Modification of the Jarque-Bera Test. for Normality

A Modification of the Jarque-Bera Test. for Normality Int. J. Contemp. Math. Sciences, Vol. 8, 01, no. 17, 84-85 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1988/ijcms.01.9106 A Moification of the Jarque-Bera Test for Normality Moawa El-Fallah Ab El-Salam

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Optimization Number of Topic Latent Dirichlet Allocation

Optimization Number of Topic Latent Dirichlet Allocation Optimization Number of Topic Latent Dirichlet Allocation Bambang Subeno Magister of Information System Universitas Diponegoro Semarang, Indonesian bambang.subeno.if@gmail.com Farikhin Department of Mathematics

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

Efficient Construction of Semilinear Representations of Languages Accepted by Unary NFA

Efficient Construction of Semilinear Representations of Languages Accepted by Unary NFA Efficient Construction of Semilinear Representations of Languages Accepte by Unary NFA Zeněk Sawa Center for Applie Cybernetics, Department of Computer Science Technical University of Ostrava 17. listopau

More information

Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square Statistic, and a Hybrid Method

Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, Chi-square Statistic, and a Hybrid Method Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing: Equivalence, hi-square Statistic, and a Hybrid Method hris Ding a, ao Li b and Wei Peng b a Lawrence Berkeley National Laboratory,

More information

Optimal Signal Detection for False Track Discrimination

Optimal Signal Detection for False Track Discrimination Optimal Signal Detection for False Track Discrimination Thomas Hanselmann Darko Mušicki Dept. of Electrical an Electronic Eng. Dept. of Electrical an Electronic Eng. The University of Melbourne The University

More information