Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Similar documents
DS504/CS586: Big Data Analytics Graph Mining II

Machine Learning & Data Mining

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

STA141C: Big Data & High Performance Statistical Computing

DM-Group Meeting. Subhodip Biswas 10/16/2014

1 Matrix notation and preliminaries from spectral graph theory

CHAPTER-17. Decision Tree Induction

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

1 Searching the World Wide Web

Collaborative topic models: motivations cont

CS145: INTRODUCTION TO DATA MINING

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Introduction to Data Mining

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

DS504/CS586: Big Data Analytics Graph Mining II

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Determining the Diameter of Small World Networks

Link Analysis Ranking

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Undirected Graphical Models

Online Social Networks and Media. Link Analysis and Web Search

CS249: ADVANCED DATA MINING

1 Matrix notation and preliminaries from spectral graph theory

Introduction to Logistic Regression and Support Vector Machine

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

Random Field Models for Applications in Computer Vision

Online Social Networks and Media. Link Analysis and Web Search

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

CS6220: DATA MINING TECHNIQUES

CS 188: Artificial Intelligence. Outline

Data Mining Recitation Notes Week 3

Behavioral Data Mining. Lecture 2

Tutorial 2. Fall /21. CPSC 340: Machine Learning and Data Mining

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Graphical Models

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

CS246 Final Exam, Winter 2011

CS6220: DATA MINING TECHNIQUES

COMP 875 Announcements

MobiHoc 2014 MINIMUM-SIZED INFLUENTIAL NODE SET SELECTION FOR SOCIAL NETWORKS UNDER THE INDEPENDENT CASCADE MODEL

Algebraic Representation of Networks

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Qualifier: CS 6375 Machine Learning Spring 2015

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS6220: DATA MINING TECHNIQUES

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

CS 781 Lecture 9 March 10, 2011 Topics: Local Search and Optimization Metropolis Algorithm Greedy Optimization Hopfield Networks Max Cut Problem Nash

Generative Clustering, Topic Modeling, & Bayesian Inference

Active and Semi-supervised Kernel Classification

Blog Distillation via Sentiment-Sensitive Link Analysis

Click Prediction and Preference Ranking of RSS Feeds

Lecture 2: Network Flows 1

Notes on Machine Learning for and

Kernel Methods. Barnabás Póczos

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Improving Diversity in Ranking using Absorbing Random Walks

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Final Exam, Spring 2006

Project in Computational Game Theory: Communities in Social Networks

3 : Representation of Undirected GM

Data Mining Techniques

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Overlapping Communities

CS-E4830 Kernel Methods in Machine Learning

The Trouble with Community Detection

DISTINGUISH HARD INSTANCES OF AN NP-HARD PROBLEM USING MACHINE LEARNING

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Problem Set 4. General Instructions

Final Examination CS 540-2: Introduction to Artificial Intelligence

CS6220: DATA MINING TECHNIQUES

CS 343: Artificial Intelligence

Statistical Methods for NLP

A brief introduction to Conditional Random Fields

MultiRank and HAR for Ranking Multi-relational Data, Transition Probability Tensors, and Multi-Stochastic Tensors

Recent Advances in Bayesian Inference Techniques

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Link Analysis and Web Search

PROBABILISTIC LATENT SEMANTIC ANALYSIS

ELEMENTARY PARTICLE CARDS

Algorithm Design and Analysis

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

Machine learning for pervasive systems Classification in high-dimensional spaces

Supporting Statistical Hypothesis Testing Over Graphs

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

ECEN 689 Special Topics in Data Science for Communications Networks

More on NP and Reductions

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

LAPLACIAN MATRIX AND APPLICATIONS

Fall 2017 Qualifier Exam: OPTIMIZATION. September 18, 2017

CPSC 540: Machine Learning

Transcription:

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee wwlee1@uiuc.edu September 28, 2004

Motivation IR on newsgroups is challenging due to lack of connection among documents Unlike WWW, can not use PageRank to improve the retrieval performance An automatically-generated social network within a newsgroup may help IR and text mining applications September 28, 2004 Page 1

Methods Overview Classify authors as for or against a topic September 28, 2004 Page 2

Methods Overview Classify authors as for or against a topic Uses graph-theoretic approach to partition the interaction graph into two partitions September 28, 2004 Page 2

Methods Overview Classify authors as for or against a topic Uses graph-theoretic approach to partition the interaction graph into two partitions graph nodes = users September 28, 2004 Page 2

Methods Overview Classify authors as for or against a topic Uses graph-theoretic approach to partition the interaction graph into two partitions graph nodes = users interaction (graph edges) = an user replying to another September 28, 2004 Page 2

Methods Overview Classify authors as for or against a topic Uses graph-theoretic approach to partition the interaction graph into two partitions graph nodes = users interaction (graph edges) = an user replying to another Assumptions September 28, 2004 Page 2

Methods Overview Classify authors as for or against a topic Uses graph-theoretic approach to partition the interaction graph into two partitions graph nodes = users interaction (graph edges) = an user replying to another Assumptions New posts contain opposite comments against parent posts September 28, 2004 Page 2

Methods Overview Classify authors as for or against a topic Uses graph-theoretic approach to partition the interaction graph into two partitions graph nodes = users interaction (graph edges) = an user replying to another Assumptions New posts contain opposite comments against parent posts There are only two groups of users with roughly the same size September 28, 2004 Page 2

Newsgroup Threads September 28, 2004 Page 3

Graph Partitioning Dene a graph G(V, E) September 28, 2004 Page 4

Graph Partitioning Dene a graph G(V, E) V = newsgroup participants September 28, 2004 Page 4

Graph Partitioning Dene a graph G(V, E) V = newsgroup participants e E where e = (v i, v j ) and v i, v j V such that v i has responded to a post by v j September 28, 2004 Page 4

Graph Partitioning Dene a graph G(V, E) V = newsgroup participants e E where e = (v i, v j ) and v i, v j V such that v i has responded to a post by v j Goal is to nd set of verticies F (for) and A (against) September 28, 2004 Page 4

Graph Partitioning Dene a graph G(V, E) V = newsgroup participants e E where e = (v i, v j ) and v i, v j V such that v i has responded to a post by v j Goal is to nd set of verticies F (for) and A (against) Maximize the cut function f(f, A) = E (F A) (NP-complete problem) September 28, 2004 Page 4

Graph Partitioning Dene a graph G(V, E) V = newsgroup participants e E where e = (v i, v j ) and v i, v j V such that v i has responded to a post by v j Goal is to nd set of verticies F (for) and A (against) Maximize the cut function f(f, A) = E (F A) (NP-complete problem) Uses spectral partitioning for eciency September 28, 2004 Page 4

Turning Social Behavior Into Graph Problem Reply to Cindy Alice Dan Bob Elaine Max Cut For Against September 28, 2004 Page 5

Graph Partitioning Methods 1. EV Algorithm (a) Co-citation matrix D = GG T with weighted edge w = # of people co-cited by author u 1 and u 2. Think of D as a similarity matrix for author u i and u j. (b) Second eigenvector of D is a good approximation of G's bipartition September 28, 2004 Page 6

Graph Partitioning Methods 1. EV Algorithm (a) Co-citation matrix D = GG T with weighted edge w = # of people co-cited by author u 1 and u 2. Think of D as a similarity matrix for author u i and u j. (b) Second eigenvector of D is a good approximation of G's bipartition 2. EV + KL (a) Uses the Kernighan-Lin heuristic to improve the partitioning September 28, 2004 Page 6

Graph Partitioning Methods 1. EV Algorithm (a) Co-citation matrix D = GG T with weighted edge w = # of people co-cited by author u 1 and u 2. Think of D as a similarity matrix for author u i and u j. (b) Second eigenvector of D is a good approximation of G's bipartition 2. EV + KL (a) Uses the Kernighan-Lin heuristic to improve the partitioning 3. EV (Constrained) and EV + KL (Constrained) (a) Identify some for and against authors, group them as one node September 28, 2004 Page 6

Graph Partitioning Methods 1. EV Algorithm (a) Co-citation matrix D = GG T with weighted edge w = # of people co-cited by author u 1 and u 2. Think of D as a similarity matrix for author u i and u j. (b) Second eigenvector of D is a good approximation of G's bipartition 2. EV + KL (a) Uses the Kernighan-Lin heuristic to improve the partitioning 3. EV (Constrained) and EV + KL (Constrained) (a) Identify some for and against authors, group them as one node 4. Iterative Classication September 28, 2004 Page 6

Graph Partitioning Methods 1. EV Algorithm (a) Co-citation matrix D = GG T with weighted edge w = # of people co-cited by author u 1 and u 2. Think of D as a similarity matrix for author u i and u j. (b) Second eigenvector of D is a good approximation of G's bipartition 2. EV + KL (a) Uses the Kernighan-Lin heuristic to improve the partitioning 3. EV (Constrained) and EV + KL (Constrained) (a) Identify some for and against authors, group them as one node 4. Iterative Classication (a) Initialize: Label for and against for a small number of people in the newsgroup September 28, 2004 Page 6

(b) Iterate m times: September 28, 2004 Page 7

(b) Iterate m times: i. Calculate the s(v i ) for each node v i. The weight w ij is the weight between node v j and v i ): s(v i ) = j s(v j) w ij j w ij September 28, 2004 Page 7

(b) Iterate m times: i. Calculate the s(v i ) for each node v i. The weight w ij is the weight between node v j and v i ): s(v i ) = j s(v j) w ij j w ij ii. Sort the labels (sign of s(v i )) by condence ( s(v i ) ) September 28, 2004 Page 7

(b) Iterate m times: i. Calculate the s(v i ) for each node v i. The weight w ij is the weight between node v j and v i ): s(v i ) = j s(v j) w ij j w ij ii. Sort the labels (sign of s(v i )) by condence ( s(v i ) ) iii. Accept k = N i m labels where i = iteration, m = total iterations, and N = number of instances in test data September 28, 2004 Page 7

Evaluation Uses three newsgroups Abortion, Gun Control, and Immigration Manually tag 50 random people in the for or against categories Comparing with classic classication algorithms (Naive Bayes & SVM) that work on message content Abortion Gun Control Immigration Majority 57% 72% 54% SVM 55% 42% 55% Naive Bayes 50% 72% 54% Iterative 67% 80% 83% EV/EV+KL 73%/75% 78%/74% 50%/52% Constrained EV/EV+KL 73%/73% 84%/82% 88%/88% Also, sensitivity experiments show more posts = more bias posts = higher accuracy September 28, 2004 Page 8

Should Contributions / Limitations Contributions Apply graph-theoretic algorithms to a new domain Sensitivity analysis on simulated newsgroup data Limitations Assume users post against each other, may not be true in some newsgroups (technical ones) Constrained and iterative method still need training data justify why the constrained methods perform much better than the unconstrained ones September 28, 2004 Page 9

Discussion Questions How does user partitioning help IR? In a complex web of discussions within a newsgroup, users may not belong to the same for or against group for all topics. How can this system be applied on such newsgroup? How is this system similar to the PageRank algorithm? Is there any other way to draw connection among the newsgroup postings? September 28, 2004 Page 10