Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Similar documents
Latent Dirichlet Allocation (LDA)

Topic Modelling and Latent Dirichlet Allocation

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Topic Modeling: Beyond Bag-of-Words

Latent Dirichlet Allocation (LDA)

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Study Notes on the Latent Dirichlet Allocation

Latent Dirichlet Allocation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Latent Dirichlet Allocation Introduction/Overview

Document and Topic Models: plsa and LDA

Distributed ML for DOSNs: giving power back to users

Latent Dirichlet Allocation Based Multi-Document Summarization

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Text Mining: Basic Models and Applications

Efficient Tree-Based Topic Modeling

topic modeling hanna m. wallach

CS145: INTRODUCTION TO DATA MINING

Topic Models and Applications to Short Documents

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

CS Lecture 18. Topic Models and LDA

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Generative Clustering, Topic Modeling, & Bayesian Inference

METHODS FOR IDENTIFYING PUBLIC HEALTH TRENDS. Mark Dredze Department of Computer Science Johns Hopkins University

Gaussian Models

Unified Modeling of User Activities on Social Networking Sites

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Multi-theme Sentiment Analysis using Quantified Contextual

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

Assessing the Consequences of Text Preprocessing Decisions

Probabilistic Latent Semantic Analysis

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Text Mining for Economics and Finance Latent Dirichlet Allocation

Language Information Processing, Advanced. Topic Models

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Welcome to CAMCOS Reports Day Fall 2011

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Probabilistic Latent Semantic Analysis

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Topic Models. Material adapted from David Mimno University of Maryland INTRODUCTION. Material adapted from David Mimno UMD Topic Models 1 / 51

how to *do* computationally assisted research

Topics in Natural Language Processing

RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Kernel Density Topic Models: Visual Topics Without Visual Words

Statistical Debugging with Latent Topic Models

A Continuous-Time Model of Topic Co-occurrence Trends

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Latent variable models for discrete data

Additive Regularization of Topic Models for Topic Selection and Sparse Factorization

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Modeling User Rating Profiles For Collaborative Filtering

Measuring Topic Quality in Latent Dirichlet Allocation

Decoupling Sparsity and Smoothness in the Discrete Hierarchical Dirichlet Process

Information Retrieval and Organisation

EMERGING TOPIC MODELS CAMCOS REPORT FALL 2011 NEETI MITTAL

AUTOMATIC DETECTION OF WORDS NOT SIGNIFICANT TO TOPIC CLASSIFICATION IN LATENT DIRICHLET ALLOCATION

N-gram Language Modeling

COMS F18 Homework 3 (due October 29, 2018)

The Role of Semantic History on Online Generative Topic Modeling

Applying hlda to Practical Topic Modeling

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

A Two-Level Learning Hierarchy of Nonnegative Matrix Factorization Based Topic Modeling for Main Topic Extraction

Homework 6: Image Completion using Mixture of Bernoullis

ECE 5984: Introduction to Machine Learning

Web Search and Text Mining. Lecture 16: Topics and Communities

Machine Learning: Assignment 1

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Comparative Summarization via Latent Dirichlet Allocation

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

A Practical Algorithm for Topic Modeling with Provable Guarantees

Conditional Random Field

16 : Approximate Inference: Markov Chain Monte Carlo

CS6220: DATA MINING TECHNIQUES

Topic Models. Charles Elkan November 20, 2008

Latent Dirichlet Conditional Naive-Bayes Models

Evaluation Methods for Topic Models

Generative Models for Discrete Data

Dimension Reduction (PCA, ICA, CCA, FLD,

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

AN INTRODUCTION TO TOPIC MODELS

Non-Negative Matrix Factorization

Advanced Introduction to Machine Learning

Boolean and Vector Space Retrieval Models

Click Prediction and Preference Ranking of RSS Feeds

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Web-Mining Agents Topic Analysis: plsi and LDA. Tanya Braun Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Classification with Perceptrons. Reading:

Distinguish between different types of scenes. Matching human perception Understanding the environment

A concave regularization technique for sparse mixture models

Fast Collapsed Gibbs Sampling For Latent Dirichlet Allocation

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

GLAD: Group Anomaly Detection in Social Media Analysis

Text mining and natural language analysis. Jefrey Lijffijt

Topic Discovery Project Report

Sparse Stochastic Inference for Latent Dirichlet Allocation

arxiv: v1 [cs.lg] 5 Jul 2010

Transcription:

Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million comments and emails submitted to FCC regarding rule-making on net neutrality. We used the text of these 1.65 million comments, built an Latent Dirichlet Allocation model and summarized top twenty themes of these public comments on net neutrality. 1 Introduction In seeking to propose new rules regarding net neutrality (anti-blocking and anti-discrimination), FCC opened to public comments and received unprecedented number of about 1.65 million replies from various parties. The purpose of our project is to understand what the public and various parties have to say on net neutrality regulation. 2 Dataset The data is publicly available on FCC website [3].We included comments from both initial open comment period (May 15 - July 15), and the reply period (July 16th - Sept 10th), comprising of 1.65 million entries after cleaning. The meta fields include name of filer, name of author or lawyer, date of filing, city, state, zip code, text of the comments, etc. The comments are not labeled. 3 Features and Preprocessing We focused on the text of comments in this project. During preprocessing phase, we split multiemail comments, removed non-alphanumeric characters, removed stopwords, generated unigram, bigram and uni+bigram feature sets, and finally converted each of the three feature sets into the data format that can be recognized by the LDA library. 4 Models For this kind of unsupervised Thematic Analysis, we may use algorithms such as Nonnegative Matrix Factorization (NMF) [2], Probabilistic Latent Semantic Analysis (PLSA) [4] and Latent Dirichlet Allocation (LDA) [1]. For the purpose of our project, we decided to perform topic modeling using the open source LDA library from Mallet [6]. Mallet uses Gibbs sampling with hyperparameter optimization and can run in parallel with multi-threads. We used 80% of the data for training and reserved 20% data for topic inference. We have experimented with unigram, bigram and uni+bigram feature sets with varying number of topics, from 5 topics to 50 topics, incrementing in 5. 4.1 Latent Dirichlet Distribution LDA is an unsupervised topic modeling technique. The basic idea of LDA is that each document contains a random mixtures of latent topics from which words are drawn. LDA treats documents as bag-of-words. The LDA generative process is illustrated by Algorithm 1 [7] 1

Algorithm 1 Latent Dirichlet Allocation Generative Process Assume we know K topic distributions for our dataset, let V be the number of tokens in our corpus and M be the number of documents. 1. For each document i, amultinomialtopicdistribution i 2 R K is drawn from a Dirichlet prior with parameters 2. For each word in the document, a topic z ij is drawn from the multinomial distribution i 3. Finally, the word w ij is drawn from the multinomial distribution zij 2 R V,and zij itself is drawn from a Dirichlet with parameters Usually the Dirichlet priors on topics and words are symmetric. Details about inference can be found in the LDA publications [1] and[7] 4.2 Perplexity evaluation We evaluate the perplexity on the test set for each of the unigram, bigram and uni+bigram feature sets. Perplexity here is defined to be ( PM ) d=1 perplexity(d test )=exp log(w d) P M d=1 N d Where M is the number of documents in the test set. For each document d, ithasn d words and w d is the sequence of words in the document. The plot of relationship between log-perplexity and number of topics is given in Figure 1 Figure 1: Plot of relationship between log-perplexity and number of topics. For given number of topics, our data set shows that basically unigram has the lowest perplexity score compared with bigram and uni+bigram feature sets 5 Results The final result we present employs LDA model with unigram features and 20 topics. After we trained a model on the training data, we used the model to infer topics on the testing data. The output is a matrix X 2 R M K, where M is the number of testing Documents and K is the number of Topics. Each row Xi T 2 R K is the probability distribution of topics in document i 2

5.1 Probability mean of topics based on testing data The mean probability for each topic is calculate cross all test data. The plot for probability mean of topics and the top-3 topics with highest probability scores are given in Figure 2 and Table 1. Figure 2: Plot of mean probability of topics based on test data Table 1: Top-3 topics with highest probability Rank Topic-Id Keywords 1 8 isps can use choice title speed destroy slow business able also others experience economic services first high company work bad 2 4 companies don cable like one content many just good comcast get see even big already thing people world idea make 3 15 economic rule opportunity fewer democracy entrepreneurs certainty must access rules protect powerful investors proposal ensure businesses strong remembered current erecting 5.2 Histogram of primary topics For each document, we chose the topic with the highest probability as the primary topic for that document. We then count the number of documents for each primary topic to generate this histogram. The plot for histogram of primary topics and the top-3 topics with highest document frequency are given in Figure 3 and Table 2. PleasenotethatthefirsttwotopicsinTable2 are the same as the first two topics in Table 1 3

Figure 3: Histogram of primary topics Table 2: Top-3 primary topics with highest document frequency Rank Topic-Id Keywords 1 4 companies don cable like one content many just good comcast get see even big already thing people world idea make 2 8 isps can use choice title speed destroy slow business able also others experience economic services first high company work bad 3 11 service common providers internet carriers communications broadband classified reclassify act want title proposed fcc federal stop telecommunications chairman past must 5.3 Word cloud of top topics We extracted the terms from the top topics in Table 1 and Table 2, thenplotthewordcloud based on term frequency. The top-10 high frequent terms are listed in Table 3 Table 3: Top-10 high frequent terms from top topics Rank Term Term-frequency 1 isps 738,377 2 economic 463,104 3 title 425,251 4 choice 422,214 5 can 421,399 6 use 393,044 7 service 334,203 8 also 318,011 9 speed 271,061 10 business 270,037 4

Figure 4: Word cloud for the terms from top topics 6 Conclusion and Discussion We trained an LDA model to find the top 20 topics from 1.65 million comments on net neutrality. Our findings in Table 1 and Table 2 suggest that: People were most concerned with the economic impact of internet service providers having power over internet-based companies. People also cited Comcast s dominance and its bundling of cable packages with internet. Finally, the third most frequent topic involved the conflict between entrepreneurship in a democratic society and the protection of powerful companies. 7 Future Some of the topic words, such as internet, net, also and can, were not particularly helpful and in hindsight should have been added to the stopword list. For future directions, one could contrast our LDA results with NFM or PLSA models. We could also make better use of metainformation, use SVM for expert comments detection. Our code is available in GitHub [5] References [1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent Dirichlet Allocation. 2003. [2] Ding Chris, Li Tao, and Peng Wei. Nonnegative Matrix Factorization and Probabilistic Latent Semantic Indexing. 2006. [3] FCC. Electronic Comment Filing System. http://www.fcc.gov/files/ecfs/14-28/ecfs-files.htm. [4] Thomas Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. 2001. [5] Kevin Mao. Net Neutrality GitHub. https://github.com/kevinmao/fcc-nnc. [6] Andrew Kachites McCallum. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu. [7] Alexander Smola and Shravan Narayanamurthy. An Architecture for Parallel Topic Models. 2010. 5