Semi-supervised learning for node classification in networks

Semi-supervised learning for node classification in networks Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Paul Bennett, John Moore, and Joel Pfeiffer)

vs. How to exploit relational dependencies to improve predictions about users? http://thenextweb.com/socialmedia/2013/11/24/facebook-grandparents-need-next-gen-social-network/#gref

G =(V,E) V := users E := friendships Data network

Gender? Married? Politics? Religion? G =(V,E) V := users E := friendships Data network

< X,Y j =0> F N D!C F Y D!C M Y D C M N C C M Y C C < X,Y i =1> F N D!C F N C!C F Y D C G =(V,E) V := users E := friendships Attributed network

How to predict labels within a single, partially labeled network? < X,Y i =1> < X,Y j =0>

Graph regularization: Assume linked nodes exhibit homophily and make predictions via label propagation Christakis and J.H. Fowler, The Spread of Obesity in a Large Social Network Over 32 Years, New England Journal of Medicine 2007; 35: 370-379; http://humannaturelab.net/resources/images/

Apply model to make predictions collectively Small world graph Labeled nodes: 30% Autocorrelation: 0.5

No learning Add links Learn weights Graph regularization GRF (Zhu et al. 03); wvrn (Macskassy et al. 07) wvrn+ (Macskassy 07); GhostEdges (Gallagher et al. 08); SCRN (Wang et al. 13) GNetMine (Ji et al. 10); LNP (Shi et al. 11); LGCW (Dhurandhar et al. 13) Probabilistic modeling Learn from labeled only RMN (Taskar et al. 01); MLN (Richardson et al. 06); RDN (Neville et al. 06) Add graph features SocialDim (Tang et al. 09); RelationStr (Xiang et al. 10); Semi-super. learning LBC (Lu et al. 03); PL-EM (Xiang et al. 08); CC-HLR (McDowell et al. 12); RDA (Pfeiffer et al. 14)

Probabilistic modeling: Learn statistical relational model and use joint inference to make predictions X 2 2 X 8 2 X 2 1 X 2 3 X 8 1 X 8 3 X 1 1 X 1 2 X 1 3 Y 2 X 4 1 X 4 2 X 4 3 X 5 1 X 5 2 X 5 3 Y 8 Y 1 X 3 1 X 3 2 X 3 3 Y 4 X 6 1 X 6 2 X 6 3 Y 5 X 7 1 X 7 2 X 7 3 Y 3 Y 6 Y 7 Estimate joint distribution: P (Y {X} n,g) Note we often have only a single network for learning

No learning Add links Learn weights Graph regularization GRF (Zhu et al. 03); wvrn (Macskassy et al. 07) wvrn+ (Macskassy 07); GhostEdges (Gallagher et al. 08); SCRN (Wang et al. 13) GNetMine (Ji et al. 10); LNP (Shi et al. 11); LGCW (Dhurandhar et al. 13) Learn from labeled only Add graph features Semi-super. learning Probabilistic modeling RMN (Taskar et al. 01); MLN (Richardson et al. 06); RDN (Neville et al. 06) SocialDim (Tang et al. 09); RelationStr (Xiang et al. 10); LBC (Lu et al. 03); PL-EM (Xiang et al. 08); CC-HLR (McDowell et al. 12); RDA (Pfeiffer et al. 14)

How does regularization compare to probabilistic modeling?

Regularization vs. probabilistic modeling (Zeno & Neville, MLG 16) Weighted vote relational neighbor Relational Bayes collective classifier Attribute correlation AUC Probabilistic modeling can exploit dependencies across wider range of scenarios, link density impacts performance Network density Network density

How to learn from one partially labeled network?

Method 1: Ignore unlabeled during learning Drop out unlabeled nodes; Training data = labeled part of network Model is defined via local conditional; optimize params using pseudolikelihood Labels ˆ Y = arg max Y Edges P (Y X, E, Y ) ˆ Y Attributes = arg max Y X v i 2V L log P Y (y i Y MBL (v i ), x i, Y ) {summation over local log conditionals}

Method 1: Apply learned model to remainder Test data = full network (but only evaluate on unlabeled) Labeled nodes seed the inference Use approximate inference to collectively classify (e.g., Variational, Gibbs) Labels Attributes Edges P (Y X, E, Y ) For unlabeled instances, iteratively estimate: P Y (y i Y MB (v i ), x i, Y )

Method 2: Semi-supervised learning Use entire network to jointly learn parameters and make inferences about class labels of unlabeled nodes Lu and Getoor (ICML 03) use relational features and ICA McDowell and Aha (ICML 12) combine two classifiers with label regularization

Semi-supervised relational learning Relational Expectation Maximization (EM) (Xiang & Neville 08) Expectation (E) Step P Y (y i Y MB (v i ), x i, Y ) ˆ Y = arg max Y X Maximization (M) Step Y U 2Y U P Y (Y U ){summation over local log conditionals} Predict labels with collective classification Use predicted probabilities during optimization (in local conditionals) B D E M A B C D E F A C J F L G H I K

How does relational EM perform? Works well when network has a moderate amount of labels If network is sparsely labeled, it is often better to use a model that is not learned Graph regularization Relational EM Why? In sparsely labeled networks, errors from the collective classification compound during propagation Both learning and inference require approximation and network structure impacts errors

Finding: Network structure can bias inference in partially-labeled networks; maximum entropy constraints correct for bias

Effect of relational biases on R-EM We compared CL-EM and PL-EM and examined the distribution of predicted probabilities on a real world dataset P(+) 25% 19% 13% Err - Amazon Co-occurrence (SNAP) - Varied class priors, 10% Labeled 6% 0% Actual RML CL-EM PL-EM Overpropagation error during inference causes PL-EM to collapse to single prediction Worse on sparsely labeled datasets Need method to correct bias for any method based on local (relational) conditional

Maximum entropy inference for PL-EM (Pfeiffer et al. WWW 15) Correction to inference (E-Step) - Enables estimation with the pseudolikelihood (M-Step) Idea: The proportion of negatively predicted items should equal the proportion of negatively labeled items - Fix: Shift the probabilities up/down Repeat for each inference itr transform probabilities to logit space: h i = 1 (P (y i = 1)) compute offset location: = P (0) V U adjust logits: h i = h i h ( ) transform back to probabilities: P (y i )= (h i ) 1 0.5 P(+) 0 Corrected probabilities are used to retrain during PL-EM (M-Step) 1 0 (P (y) =0.5) Pivot = 5/7 1

Experimental results - Correction effects 100% Amazon (small prior) 100% Amazon (large prior) 75% 75% P(+) 50% P(+) 50% 25% 25% 0% Actual RML CL-EM PL-EM (Naive) PL-EM(MaxEntInf) Max entropy correction removes bias due to over propagation in collective inference 0% Actual RML CL-EM PL-EM (Naive) PL-EM (MaxEntInf)

Experimental results - Large patent dataset BAE LR LR (EM) RLR LP CL-EM PL-EM (Naive) PL-EM (MaxEntInf) 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 10 3 10 2 10 1 Proportion Labeled Computers PL-EM (Naive) PL-EM (MaxEnt) BAE 0.50 0.45 0.40 0.35 0.30 0.25 10 3 10 2 10 1 Proportion Labeled Organic PL-EM (Naive) PL-EM (MaxEnt) Correction allows relational EM to improve over competing methods in sparsely labeled domains Note: McDowell & Aha (ICML 12) may correct same effect, but during estimation rather than inference

Can neural networks improve predictions by further reducing bias?

Deep collective inference (Moore & Neville AAAI 17) Approach: Use a neural network with neighbors attributes and class label predictions as inputs Key ideas: Represent set of neighbors as a sequence, in random order To deal with heterogenous inputs (i.e., varying number of neighbors), use a recurrent model (LSTM) To learn with partially labeled network, use semi-supervised collective classification

Example network b e g h d f j a c i red = target node blue = neighbors grey = labeled white = unlabeled

Example network red = target node blue = neighbors grey = labeled white = unlabeled b e i f a d ŷ (t c) d [f b, ŷ (t c 1) b ] [f f, ŷ (t c 1) f ] [f d, ŷ (t c 1) [f e,y e ] [f i,y i ] [f a,y a ] d ]

Depp collective inference (DCI): Model description For node, and current iteration, the input is node features concatenated with previous prediction [f i, ŷ (t c 1) i ] and neighbor features concatenated with predictions/labels {[f j, (y j or ŷ (t c 1) j )] v j 2 N i } For node v i t c v i, specified input is: x i =[x (0) apple = i, x (1) i,...,x ( N i ) i ] [f j1,y j1 ], [f j2,y j2 ],...,[f i, ŷ (t c 1) i ] b e g h d f j b e i f a d a c i ^ (t x d = [< f b,y c -1) (t b >, < f e,y e >, < f i,y i >, < f f,y c -1) (t f >, < f a,y a >, < f d,y c -1) d > ] (0) (1) (2) (3) (4) (5) = [x d, x d, x d, x d, x d, x d ] ^ ^

LSTM Structure y^ Structure of LSTM with sequential inputs... x (0) x (1) x ( Ɲ -1) x ( Ɲ ) LSTM input at end of sequence with w hidden units and p features (t-1) h 1... (t-1) h w LSTM cell ŷ f 0 f 1... f p y^ x ( Ɲ )

Learning: key aspects Initialize label predictions with non-collective version of model Deep Relational Inference (DRI) Semi-supervised learning: Estimate parameters until convergence, then perform collective inference to make predictions for all unlabeled nodes Randomize neighbor order on every iteration Correct for imbalanced classes, either by balancing the objective function or by balancing the data with augmentation Use backpropagation through time with early stopping and cross-entropy loss

Evaluation on small-medium sized networks Amazon DVD (21/79) Amazon DVD (50/50) LR LP LR+ RNCC PLEM PLEM+ DCI LR LP LR+ RNCC PLEM PLEM+ DCI Lower BAE is better

Evaluation on large network (900K nodes) Patents (17/83) Lower BAE is better PLEM PLEM+ RNCC DCI Overall: 12% gain over PLEM+N2V 20% gain over RNCC (competing NN)

How does network structure impact performance of learning and inference?

Impact of network on collective classification Non-stationarity in network structure reduces accuracy (Angin & N 08) Propagation error during inference is mediated by local network structure (Xiang & N 11) The common thread among these effects is a difference in graph distribution between the learning and inference settings Difference in network structure and label availability results in variance among node marginals, which impairs performance (Xiang & N 13) Structure of rolled-out model can lead to inference instability in collective classification (Pfeiffer et al. 14) which increases error due to bias in learning and/or inference Understanding the mechanisms that lead to error can help to improve models/ algorithms, e.g., neural networks may help to reduce bias, but at expense of increased variance

Thanks neville@cs.purdue.edu www.cs.purdue.edu/~neville