Semi-supervised learning for node classification in networks

Similar documents
Collective classification in large scale networks. Jennifer Neville Departments of Computer Science and Statistics Purdue University

How to exploit network properties to improve learning in relational domains

Supporting Statistical Hypothesis Testing Over Graphs

Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models

A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation

A brief introduction to Conditional Random Fields

Node similarity and classification

Active and Semi-supervised Kernel Classification

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Introduction to Graphical Models

Undirected Graphical Models

Introduction to Machine Learning Midterm Exam

Sequential Supervised Learning

Introduction to Machine Learning Midterm Exam Solutions

Classification Semi-supervised learning based on network. Speakers: Hanwen Wang, Xinxin Huang, and Zeyu Li CS Winter

Naïve Bayes classification

Probabilistic Models for Sequence Labeling

Learning From Crowds. Presented by: Bei Peng 03/24/15

A Deep Interpretation of Classifier Chains

IFT Lecture 7 Elements of statistical learning theory

Measuring Social Influence Without Bias

Probabilistic Graphical Models

RaRE: Social Rank Regulated Large-scale Network Embedding

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Machine Learning for Structured Prediction

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Pattern Recognition and Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

FINAL: CS 6375 (Machine Learning) Fall 2014

lecture 6: modeling sequences (final part)

Support Vector Machines

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity- Representativeness Reward

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CS Machine Learning Qualifying Exam

Probabilistic Graphical Models

Be able to define the following terms and answer basic questions about them:

Introduction to Bayesian Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Qualifier: CS 6375 Machine Learning Spring 2015

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Notes on Machine Learning for and

10-701/15-781, Machine Learning: Homework 4

Sampling of Attributed Networks from Hierarchical Generative Models

Probabilistic Graphical Networks: Definitions and Basic Results

Logistic Regression & Neural Networks

Online Bayesian Passive-Agressive Learning

Hidden Markov Models

ECLT 5810 Classification Neural Networks. Reference: Data Mining: Concepts and Techniques By J. Hand, M. Kamber, and J. Pei, Morgan Kaufmann

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Conditional Random Field

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Instance-based Domain Adaptation

Advanced statistical methods for data analysis Lecture 2

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Some Probability and Statistics

STA 4273H: Statistical Machine Learning

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Generative Clustering, Topic Modeling, & Bayesian Inference

Unsupervised Learning

Probabilistic Graphical Models & Applications

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Tensor Methods for Feature Learning

A Method for Inferring Label Sampling Mechanisms in Semi-Supervised Learning

Introduction An approximated EM algorithm Simulation studies Discussion

Learning in Probabilistic Graphs exploiting Language-Constrained Patterns

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Lecture 3: Pattern Classification

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Adaptive Multi-Modal Sensing of General Concealed Targets

Machine Learning Practice Page 2 of 2 10/28/13

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

ECE521 Lecture 7/8. Logistic Regression

CSCE 478/878 Lecture 6: Bayesian Learning

Natural Language Processing with Deep Learning CS224N/Ling284

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Machine Learning for Signal Processing Bayes Classification and Regression

Artificial Neural Network

Online Bayesian Passive-Aggressive Learning

Introduction to Machine Learning Midterm, Tues April 8

Bayesian Methods: Naïve Bayes

Machine Learning Lecture Notes

Cost-Sensitive Learning with Conditional Markov Networks

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

UNSUPERVISED LEARNING

Inferring Useful Heuristics from the Dynamics of Iterative Relational Classifiers

Probability Based Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Statistical Learning Reading Assignments

Final Exam, Spring 2006

Transcription:

Semi-supervised learning for node classification in networks Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Paul Bennett, John Moore, and Joel Pfeiffer)

vs. How to exploit relational dependencies to improve predictions about users? http://thenextweb.com/socialmedia/2013/11/24/facebook-grandparents-need-next-gen-social-network/#gref

G =(V,E) V := users E := friendships Data network

Gender? Married? Politics? Religion? G =(V,E) V := users E := friendships Data network

< X,Y j =0> F N D!C F Y D!C M Y D C M N C C M Y C C < X,Y i =1> F N D!C F N C!C F Y D C G =(V,E) V := users E := friendships Attributed network

How to predict labels within a single, partially labeled network? < X,Y i =1> < X,Y j =0>

Graph regularization: Assume linked nodes exhibit homophily and make predictions via label propagation Christakis and J.H. Fowler, The Spread of Obesity in a Large Social Network Over 32 Years, New England Journal of Medicine 2007; 35: 370-379; http://humannaturelab.net/resources/images/

Apply model to make predictions collectively Small world graph Labeled nodes: 30% Autocorrelation: 0.5

No learning Add links Learn weights Graph regularization GRF (Zhu et al. 03); wvrn (Macskassy et al. 07) wvrn+ (Macskassy 07); GhostEdges (Gallagher et al. 08); SCRN (Wang et al. 13) GNetMine (Ji et al. 10); LNP (Shi et al. 11); LGCW (Dhurandhar et al. 13) Probabilistic modeling Learn from labeled only RMN (Taskar et al. 01); MLN (Richardson et al. 06); RDN (Neville et al. 06) Add graph features SocialDim (Tang et al. 09); RelationStr (Xiang et al. 10); Semi-super. learning LBC (Lu et al. 03); PL-EM (Xiang et al. 08); CC-HLR (McDowell et al. 12); RDA (Pfeiffer et al. 14)

Probabilistic modeling: Learn statistical relational model and use joint inference to make predictions X 2 2 X 8 2 X 2 1 X 2 3 X 8 1 X 8 3 X 1 1 X 1 2 X 1 3 Y 2 X 4 1 X 4 2 X 4 3 X 5 1 X 5 2 X 5 3 Y 8 Y 1 X 3 1 X 3 2 X 3 3 Y 4 X 6 1 X 6 2 X 6 3 Y 5 X 7 1 X 7 2 X 7 3 Y 3 Y 6 Y 7 Estimate joint distribution: P (Y {X} n,g) Note we often have only a single network for learning

No learning Add links Learn weights Graph regularization GRF (Zhu et al. 03); wvrn (Macskassy et al. 07) wvrn+ (Macskassy 07); GhostEdges (Gallagher et al. 08); SCRN (Wang et al. 13) GNetMine (Ji et al. 10); LNP (Shi et al. 11); LGCW (Dhurandhar et al. 13) Learn from labeled only Add graph features Semi-super. learning Probabilistic modeling RMN (Taskar et al. 01); MLN (Richardson et al. 06); RDN (Neville et al. 06) SocialDim (Tang et al. 09); RelationStr (Xiang et al. 10); LBC (Lu et al. 03); PL-EM (Xiang et al. 08); CC-HLR (McDowell et al. 12); RDA (Pfeiffer et al. 14)

How does regularization compare to probabilistic modeling?

Regularization vs. probabilistic modeling (Zeno & Neville, MLG 16) Weighted vote relational neighbor Relational Bayes collective classifier Attribute correlation AUC Probabilistic modeling can exploit dependencies across wider range of scenarios, link density impacts performance Network density Network density

How to learn from one partially labeled network?

Method 1: Ignore unlabeled during learning Drop out unlabeled nodes; Training data = labeled part of network Model is defined via local conditional; optimize params using pseudolikelihood Labels ˆ Y = arg max Y Edges P (Y X, E, Y ) ˆ Y Attributes = arg max Y X v i 2V L log P Y (y i Y MBL (v i ), x i, Y ) {summation over local log conditionals}

Method 1: Apply learned model to remainder Test data = full network (but only evaluate on unlabeled) Labeled nodes seed the inference Use approximate inference to collectively classify (e.g., Variational, Gibbs) Labels Attributes Edges P (Y X, E, Y ) For unlabeled instances, iteratively estimate: P Y (y i Y MB (v i ), x i, Y )

Method 2: Semi-supervised learning Use entire network to jointly learn parameters and make inferences about class labels of unlabeled nodes Lu and Getoor (ICML 03) use relational features and ICA McDowell and Aha (ICML 12) combine two classifiers with label regularization

Semi-supervised relational learning Relational Expectation Maximization (EM) (Xiang & Neville 08) Expectation (E) Step P Y (y i Y MB (v i ), x i, Y ) ˆ Y = arg max Y X Maximization (M) Step Y U 2Y U P Y (Y U ){summation over local log conditionals} Predict labels with collective classification Use predicted probabilities during optimization (in local conditionals) B D E M A B C D E F A C J F L G H I K

How does relational EM perform? Works well when network has a moderate amount of labels If network is sparsely labeled, it is often better to use a model that is not learned Graph regularization Relational EM Why? In sparsely labeled networks, errors from the collective classification compound during propagation Both learning and inference require approximation and network structure impacts errors

Finding: Network structure can bias inference in partially-labeled networks; maximum entropy constraints correct for bias

Effect of relational biases on R-EM We compared CL-EM and PL-EM and examined the distribution of predicted probabilities on a real world dataset P(+) 25% 19% 13% Err - Amazon Co-occurrence (SNAP) - Varied class priors, 10% Labeled 6% 0% Actual RML CL-EM PL-EM Overpropagation error during inference causes PL-EM to collapse to single prediction Worse on sparsely labeled datasets Need method to correct bias for any method based on local (relational) conditional

Maximum entropy inference for PL-EM (Pfeiffer et al. WWW 15) Correction to inference (E-Step) - Enables estimation with the pseudolikelihood (M-Step) Idea: The proportion of negatively predicted items should equal the proportion of negatively labeled items - Fix: Shift the probabilities up/down Repeat for each inference itr transform probabilities to logit space: h i = 1 (P (y i = 1)) compute offset location: = P (0) V U adjust logits: h i = h i h ( ) transform back to probabilities: P (y i )= (h i ) 1 0.5 P(+) 0 Corrected probabilities are used to retrain during PL-EM (M-Step) 1 0 (P (y) =0.5) Pivot = 5/7 1

Experimental results - Correction effects 100% Amazon (small prior) 100% Amazon (large prior) 75% 75% P(+) 50% P(+) 50% 25% 25% 0% Actual RML CL-EM PL-EM (Naive) PL-EM(MaxEntInf) Max entropy correction removes bias due to over propagation in collective inference 0% Actual RML CL-EM PL-EM (Naive) PL-EM (MaxEntInf)

Experimental results - Large patent dataset BAE LR LR (EM) RLR LP CL-EM PL-EM (Naive) PL-EM (MaxEntInf) 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 10 3 10 2 10 1 Proportion Labeled Computers PL-EM (Naive) PL-EM (MaxEnt) BAE 0.50 0.45 0.40 0.35 0.30 0.25 10 3 10 2 10 1 Proportion Labeled Organic PL-EM (Naive) PL-EM (MaxEnt) Correction allows relational EM to improve over competing methods in sparsely labeled domains Note: McDowell & Aha (ICML 12) may correct same effect, but during estimation rather than inference

Can neural networks improve predictions by further reducing bias?

Deep collective inference (Moore & Neville AAAI 17) Approach: Use a neural network with neighbors attributes and class label predictions as inputs Key ideas: Represent set of neighbors as a sequence, in random order To deal with heterogenous inputs (i.e., varying number of neighbors), use a recurrent model (LSTM) To learn with partially labeled network, use semi-supervised collective classification

Example network b e g h d f j a c i red = target node blue = neighbors grey = labeled white = unlabeled

Example network red = target node blue = neighbors grey = labeled white = unlabeled b e i f a d ŷ (t c) d [f b, ŷ (t c 1) b ] [f f, ŷ (t c 1) f ] [f d, ŷ (t c 1) [f e,y e ] [f i,y i ] [f a,y a ] d ]

Depp collective inference (DCI): Model description For node, and current iteration, the input is node features concatenated with previous prediction [f i, ŷ (t c 1) i ] and neighbor features concatenated with predictions/labels {[f j, (y j or ŷ (t c 1) j )] v j 2 N i } For node v i t c v i, specified input is: x i =[x (0) apple = i, x (1) i,...,x ( N i ) i ] [f j1,y j1 ], [f j2,y j2 ],...,[f i, ŷ (t c 1) i ] b e g h d f j b e i f a d a c i ^ (t x d = [< f b,y c -1) (t b >, < f e,y e >, < f i,y i >, < f f,y c -1) (t f >, < f a,y a >, < f d,y c -1) d > ] (0) (1) (2) (3) (4) (5) = [x d, x d, x d, x d, x d, x d ] ^ ^

LSTM Structure y^ Structure of LSTM with sequential inputs... x (0) x (1) x ( Ɲ -1) x ( Ɲ ) LSTM input at end of sequence with w hidden units and p features (t-1) h 1... (t-1) h w LSTM cell ŷ f 0 f 1... f p y^ x ( Ɲ )

Learning: key aspects Initialize label predictions with non-collective version of model Deep Relational Inference (DRI) Semi-supervised learning: Estimate parameters until convergence, then perform collective inference to make predictions for all unlabeled nodes Randomize neighbor order on every iteration Correct for imbalanced classes, either by balancing the objective function or by balancing the data with augmentation Use backpropagation through time with early stopping and cross-entropy loss

Evaluation on small-medium sized networks Amazon DVD (21/79) Amazon DVD (50/50) LR LP LR+ RNCC PLEM PLEM+ DCI LR LP LR+ RNCC PLEM PLEM+ DCI Lower BAE is better

Evaluation on large network (900K nodes) Patents (17/83) Lower BAE is better PLEM PLEM+ RNCC DCI Overall: 12% gain over PLEM+N2V 20% gain over RNCC (competing NN)

How does network structure impact performance of learning and inference?

Impact of network on collective classification Non-stationarity in network structure reduces accuracy (Angin & N 08) Propagation error during inference is mediated by local network structure (Xiang & N 11) The common thread among these effects is a difference in graph distribution between the learning and inference settings Difference in network structure and label availability results in variance among node marginals, which impairs performance (Xiang & N 13) Structure of rolled-out model can lead to inference instability in collective classification (Pfeiffer et al. 14) which increases error due to bias in learning and/or inference Understanding the mechanisms that lead to error can help to improve models/ algorithms, e.g., neural networks may help to reduce bias, but at expense of increased variance

Thanks neville@cs.purdue.edu www.cs.purdue.edu/~neville