Collective classification in large scale networks. Jennifer Neville Departments of Computer Science and Statistics Purdue University

Size: px

Start display at page:

Download "Collective classification in large scale networks. Jennifer Neville Departments of Computer Science and Statistics Purdue University"

Cynthia Andrews
6 years ago
Views:

1 Collective classification in large scale networks Jennifer Neville Departments of Computer Science and Statistics Purdue University

2 The data mining process Network Datadata Knowledge Selection Interpretation evaluation Target data Patterns Preprocessing Network data is: heterogeneous and interdependent, partially observed/labeled, dynamic and/or non-stationary, and often drawn from a single network...thus many traditional methods developed for i.i.d. data do not apply Mining Processed data

3 Example predictive models for network domains Predict organizational roles from communication patterns Predict paper topics from properties of cited papers Predict personal preferences from characteristics of friends networks! Scientific networks! Social networks! Predict protein function from interaction patterns Predict content changes from properties of hyperlinked pages Predict group effectiveness from communication patterns Gene/protein networks! World wide web! Organizational networks!

4 Attribute prediction in networks < X,Y i => < X,Y j =0> Observed network autocorrelation

5 Network autocorrelation is ubiquitous Marketing Product/service adoption among communicating customers (Domingos & Richardson 0, Hill et al 06) Advertising On-line brand adv. (Provost et al. 09) Fraud detection Fraud status of cellular customers who call common numbers (Fawcett & Provost 97, Cortes et al 0) Fraud status of brokers who work at the same branch (Neville & Jensen 05) Biology Functions of proteins located in together in cells (Neville & Jensen 0) Tuberculosis infection among people in close contact (Getoor et al 0) Movies Web Box-office receipts of movies made by the same studio (Jensen & Neville 0) Topics of hyperlinked web pages (Chakrabarti et al 98, Taskar et al 0) Business Industry categorization of corporations that share common boards members (Neville & Jensen 00) Industry categorization of corporations that co-occur in news stories (Bernstein et al 0) Citation analysis Topics of coreferent scientific papers (Taskar et al 0, Neville & Jensen 0)

6 Exploiting network autocorrelation with collective classification Use label propagation/approximate inference to collectively classify unobserved nodes in partially labeled network

7 How do we learn models for collective classification?

8 In practice we have a single partially-labeled network How do we learn a model and make predictions? Within-network learning

9 Main approches to collective classification No learning Add links Learn weights Graph regularization GRF (Zhu et al. 0); wvrn (Macskassy et al. 07) wvrn+ (Macskassy 07); GhostEdges (Gallagher et al. 08); SCRN (Wang et al. ) GNetMine (Ji et al. 0); LNP (Shi et al. ); LGCW (Dhurandhar et al. ) Probabilistic modeling Learn from labeled only RMN (Taskar et al. 0); MLN (Richardson et al. 06); RDN (Neville et al. 06) Add graph features SocialDim (Tang et al. 09); RelationStr (Xiang et al. 0); Semi-super. learning LBC (Lu et al. 0); PL-EM (Xiang et al. 08); CC-HLR (McDowell et al. ); RDA (Pfeiffer et al. 4)

10 Graph regularization perspective

11 Gaussian Random Fields: Zhu, Ghahramani, Lafferty ICML 0

12 Same idea applied to relational data to incorporate guilt by association into collective classification (wvrn Macskassy and Provost MRDM 0)

13 Weighted-vote Relational Neighbor (wvrn) To find the label of a node vi, perform a weighted vote among its neighbors N(vi): P (y i = N (v i )) = N (v i ) X v j N (v i ) P (y j = N (v j )) Collective classification iteratively recomputes probabilities based on current beliefs of neighbors until convergence

14 Apply model to make predictions collectively Small world graph Labeled nodes: 0% Autocorrelation: 50%

15 Apply model to make predictions collectively Random graph Labeled nodes: 0% Autocorrelation: 50%

16 Extensions Graph regularization is used in mainstream ML for i.i.d. data links are added to account for instance similarity In the network community, the links are given apriori and represent relations among the instances, so work has focused on: Adding extra links/weights to improve performance: Macskassy (AAAI 07) incorporates several types of links, weighting each by assortativity Gallagher et al. (KDD 08) add ghost edges based on random walks Wang et al. (KDD ) weight edges based on social context features

17 Extensions (cont) Learning how to weight existing and added links: Ji et al. (PKDD 0) learn how to weight meta-paths in heterogeneous networks Shi et al. (CIKM ) learn how to weight different types of latent links Dhurandhar et al. (JAIR ) learn how to weight individual links

18 Probabilistic modeling perspective

Relational Machine learning learning 0 Data representation Knowledge representation Email networks! Objective function relational data Social networks! Scientific networks!

19 Relational Machine learning learning 0 Data representation Knowledge representation networks! Objective function relational data Social networks! Scientific networks! relational models Bn Bn Firm Broker (Bk) Disclosure Branch (Bn) Bn Size 4 Search algorithm Gene/protein networks! Problem In Past Has Business Is Problem On Watchlist Year Bk Region Type Bk Area Bk Layoffs Bn On Watchlist World wide web! Organizational networks! Bn Bk

20 There has been a great deal of work on templated graphical model representations for relational data RBNs PRMs RMNs IHRMs MLNs DAPER GMNs RDNs Since model representation is also graphical we need to distinguish data networks from model networks

21 Data network

22 Gender? Married? Politics? Religion? Data network

23 Data representation F N D!C F Y D!C M Y D C M N C C M Y C C F N D!C F N C!C F Y D C Estimate joint distribution: or conditional distribution: P (Y {X} n,g) Attributed network P (Y i X i, X R, Y R ) Note we often have only a single network for learning

24 Define structure of graphical model Politics i Politics j Relational template Politics i Gender i Politics i Married i Politics i Religion i

25 Y i Y j Relational template Y i X i Y i X i Y i X i Model template

26 Y i Y j Y i X i + Y i X i Y i X i Model template Data network

27 Knowledge representation X X 8 X X X 8 X 8 X X X Y X 4 X 4 X 4 X 5 X 5 X 5 Y 8 Y X X X Y 4 X 6 X 6 X 6 Y 5 X 7 X 7 X 7 Y Y 6 Y 7 Model network (rolled-out graphical model)

28 Objective 4 Search: eg. convex function optimization X X 8 X X X 8 X 8 X X X Y X 4 X 4 X 4 X 5 X 5 X 5 Y 8 Y X X X Y 4 X 6 X 6 X 6 Y 5 X 7 X 7 X 7 Y Y 6 Y 7 Learn model parameters from fully labeled network P (y G x G )= Z(, x G ) T T C C(T (G)) T (x C, y C ; T )

29 Apply model to make predictions in another network drawn from the same distribution Y i Y j X X X X X X X 8 X 8 X 8 Y i X i Y + Y X 4 X 4 X 4 Y 8 X 5 X 5 X 5 Y i X i X Y 4 X 6 Y 5 X 7 Y i X i X X X 6 X 6 X 7 X 7 Y Y 6 Y 7 Model template Model rolled out on test network (assumption: network is drawn from same distribution as training) Test network

30 Collective classification uses full joint, rolled out model for inference but labeled nodes impact the final model structure X X X X X X X X X X X X X 8 X 8 X 8 X 8 X 8 X 8 Y Y Y Y X 4 Y 4 X 4 X 4 Y 4 Y 8 Y 8 X 5 X 5 X 5 X 5 X 5 X 5 X X X X X X Y 4 Y 4 Y 4 X 6 X 6 X 6 Y 5 Y X 5 6 X 6 X 7 X 6 X 7 X 7 X 7 X 7 X 7 Y Y Y 6 Y 6 Y 7 Y 7 Labeled node

31 Collective classification uses full joint, rolled out model for inference but labeled nodes impact the final model structure X X X The structure X X X of rolled-out 8 relational X X 8 X 8 graphical models are determined by the Y Y X Y structure of the underlying 8 4 data X 5 network, X 4 X 4 X 5 X 5 Labeled node including location + availability of labels Y 4 Y 5 X X 6 X 7 X X X 6 X X 7 X this can impact performance 6 7 of Y Y 6 Y learning and inference methods 7 via representation, objective function, and search algorithm

32 Networks are much, much larger in practice and often there s only one partially labeled network

33 Approach #: Ignore unlabeled during learning Drop out unlabeled nodes; Training data = labeled part of network Model is defined via local conditional; optimize params using pseudolikelihood Labels ˆ Y = arg max Y Edges P (Y X, E, Y ) ˆ Y Attributes = arg max Y X v i V L log P Y (y i Y MBL (v i ), x i, Y ) {summation over local log conditionals}

34 Approach #: Apply learned model to remainder Test data = full network (but only evaluate on unlabeled) Labeled nodes seed the inference Use approximate inference to collectively classify (e.g., Variational, Gibbs) Labels Attributes Edges P (Y X, E, Y ) For unlabeled instances, iteratively estimate: P Y (y i Y MB (v i ), x i, Y )

35 Approach #: Add graph features during learning Training data = labeled nodes in network; Added features to incorporate unlabeled graph structure Tang and Liu (KDD 09) find low-rank approximation of graph structure, then use as features Xiang et al. (WWW 0) use unsupervised learning to model relationship strength and weight edges

36 Approach #: Semi-supervised learning Use entire network to jointly learn parameters and make inferences about class labels of unlabeled nodes Lu and Getoor (ICML 0) use relational features and ICA McDowell and Aha (ICML ) combine two classifiers with label regularization

37 Semi-supervised relational learning Relational Expectation Maximization (EM) (Xiang & Neville 08) Expectation (E) Step P Y (y i Y MB (v i ), x i, Y ) ˆ Y = arg max Y X Maximization (M) Step Y U Y U P Y (Y U ){summation over local log conditionals} A Predict labels with collective classification B C D J E F M L Use predicted probabilities during optimization (in local conditionals) A B C D E F G H I K

How does relational EM perform? Works well when network has a moderate amount of labels If network is sparsely labeled, it is often better to use a model that is not learned Why?

38 How does relational EM perform? Works well when network has a moderate amount of labels If network is sparsely labeled, it is often better to use a model that is not learned Why? In sparsely labeled networks, errors from the collective classification compound during propagation Label propagation Relational EM Both learning and inference require approximation and network structure impacts errors

39 Impact of approximations in semi-supervised RL Does over propagation during prediction happen in real world, sparsely labeled networks? YES. Class Prior Class Prior Total Negatives Total Positives Conditional: Relational Naive Bayes Trial Trial Trial Trial 4 Trial Total Negatives5000 Trial Trial Trial Trial 4 Trial Total Positives Conditional: Relational Logistic Regression

40 Finding #: Network structure can bias inference in partially-labeled networks; maximum entropy constraints correct for bias

41 Effect of relational biases on R-EM We compared CL-EM and PL-EM and examined the distribution of predicted probabilities on a real world dataset P(+) 5% 9% % Err - Amazon Co-occurrence (SNAP) 6% - Varied class priors, 0% Labeled 0% Actual RML CL-EM PL-EM Overpropagation error during inference causes PL-EM to collapse to single prediction Worse on sparsely labeled datasets Need method to correct bias for any method based on local (relational) conditional

42 Maximum entropy inference for PL-EM (Pfeiffer et al. WWW 5) Correction to inference (E-Step) - Enables estimation with the pseudolikelihood (M-Step) Idea: The proportion of negatively predicted items should equal the proportion of negatively labeled items - Fix: Shift the probabilities up/down Repeat for each inference itr transform probabilities to logit space: h i = (P (y i = )) compute offset location: = P (0) V U adjust logits: h i = h i h ( ) transform back to probabilities: P (y i )= (h i ) 0.5 P(+) Corrected probabilities are used to 0 retrain during PL-EM (M-Step) 0 (P (y) =0.5) Pivot = 5/7

43 Experimental results - Correction effects 00% Amazon (small prior) 00% Amazon (large prior) 75% 75% P(+) 50% P(+) 50% 5% 5% 0% Actual RML CL-EM PL-EM (Naive) PL-EM(MaxEntInf) Max entropy correction removes bias due to over propagation in collective inference 0% Actual RML CL-EM PL-EM (Naive) PL-EM (MaxEntInf)

44 Experimental results - Large patent dataset BAE LR LR (EM) RLR LP CL-EM PL-EM (Naive) PL-EM (MaxEntInf) Proportion Labeled Computers PL-EM (Naive) PL-EM (MaxEnt) BAE Proportion Labeled Organic PL-EM (Naive) PL-EM (MaxEnt) Correction allows relational EM to improve over competing methods in sparsely labeled domains Note: McDowell & Aha (ICML ) may correct same effect, but during estimation rather than inference

45 Finding #: Network structure can bias learning in partially-labeled networks; modeling uncertainty corrects for bias

46 Impact of approximations in semi-supervised RL Over correction also happens during parameter estimation in semiin real world, sparsely labeled networks. P (+ y) P ( ) P ( +) P ( y) w[] w w[0] Relational Naive Bayes Relational Logistic Regression

47 Impact of approximations in semi-supervised RL Relational EM: E-Step: P Y (y i Y MB (v i ), x i, Y ) Collective Inference Approximation (For all unlabeled instances) M-Step: ˆ Y Over-propagation error - Both classes neighbor yellow thus yellow neighbors are not predictive Produces over- correction in estimation = arg max Y X Y U Y U P Y (Y U ) X v i V L log P Y (y i ỸMB(v i ), x i, Y ) Pseudolikelihood Approximation Accounting for uncertainty in parameter estimates will correct for this bias

48 Relational data augmentation (Pfeiffer et al. ICDM 4) Data augmentation is a Bayesian version of EM - Parameters are RVs - Compute posterior predictive distribution (Tanner & Wong 87) We developed a relational version of data augmentation Final inference is over a distribution of parameter values Requires prior distributions over parameters and sampling methods Parameters Fixed Point Predictions Stochastic Fixed Point Relational EM Stochastic Alternate Between: Relational Stochastic EM Gibbs sample of labels ỸU t PY t (Y U Y L, X, E, t Y ) Sample Parameters t Y P t ( Y Ỹt U, Y L, X, E) Final Parameters: Final Inference: ˆ Y = T Ŷ t U = T X t X t Relational Data Augmentation t Y Ỹ t U

49 Experimental results - Amazon DVD Relational EM Relational EM 0- Loss Percentage of Graph Labeled Naive Bayes 0- Loss Percentage of Graph Labeled Logistic Regression Relational DA Relational DA Relational EM s instability in sparsely labeled domains causes poor performance

50 Experimental results - Facebook MAE Relational EM Percentage of Graph Labeled Relational Naive Bayes SEM Relational DA MAE Relational EM Percentage of Graph Labeled Logistic Regression Relational DA Relational data augmentation can outperform relational stochastic EM

51 Finding #: Implicit assumption is that nodes of the same type should be identically distributed but many relational representations cannot ensure this holds for varying graph structures

52 I.I.D. assumption revisited Current relational models do not impose the same marginal invariance condition that is assumed in IID models, which can impair generalization p(y A x A ) A B E p(y E x E ) C D F p(y A x A ) 6= p(y E x E ) due to varying graph structure Markov relational network representation does not allow us to explicitly specify the form of the marginal probability distributions, thus it is difficult to impose any equality constraints on the marginals

53 Is there an alternative approach? Goal: Combine the marginal invariance advantages of IID models with the ability to model relational dependence Incorporate node attributes in a general way (similar to IID classifiers) Idea: Apply copulas to combine marginal models with dependence structure F... t t t tn z z z zn t jointly ~ Copula theory: can construct n-dimensional z vector i = F ( of ) i ( arbitrary i (t i )) marginals while preserving the desired dependence structure... zi marginally ~ Fi

54 Let s start with a reformulation of IID classifiers... General form of probabilistic binary classification: e.g., Logistic regression p(y i = ) = F ( (x i )) Now view F as the CDF of a distribution symmetric around 0 to obtain a latent variable formulation: z is a continuous variable, capturing random effects that are not present in x p is the corresponding PDF of F z i p(z i = z x i = x) =f(z (x i )) y i = sign(z i ) In IID models, the random effect for each instance is independent, thus can be integrated out When links among instances are observed, the correlations among their class labels can be modeled through dependence among the z s Key question: How to model the dependence among z s while preserving the marginals?? Zj

55 Copula Latent Markov Network (CLMN) IID classifiers CLMN The CLMN model Sample t from the desired joint dependency:(t,t,...,t n ) Apply marginal transformation to obtain the latent variable z: z i = F ( ) i ( i (t i )) Classification: y i = sign(z i )

56 Copula Latent Markov Network (Xiang and N. WSDM ) CLMN implementation Gaussian Markov network Estimation: First, learn marginal model as if instances were IID Next, learn the dependence model conditioned on the marginal model... but GMN has no parameters to learn Logistic regression Inference: Conditional inference in copulas have not previously been considered for largescale networks For efficient inference, we developed a message passing algorithm based on EP

57 Experimental Results CLMN SocDim RMN LR GMN Key idea: Ensuring that nodes with varying graph Facebook structure have identical marginals improves learning Gene IMDB IMDB

Conclusion Relational models have been shown to significantly improve network predictions through the use of joint modeling and collective

information for learning and prediction A careful consideration of interactions between: data representation, knowledge representation, objective

58 Conclusion Relational models have been shown to significantly improve network predictions through the use of joint modeling and collective inference But we need to understand the impact of real- world graph characteristics on model/algorithms in order to better exploit network information for learning and prediction A careful consideration of interactions between: data representation, knowledge representation, objective functions, and search algorithms will improve our understanding of the choices/mechanisms that affect performance and drive the development of new algorithms

59 What can network science offer to RML? Consider effects of graph structure on learning/inference ML methods learn a model conditioned on graphp (Y {X} n,g) When is a new graph drawn from the same distribution? How to model distributions of networks? (see Moreno et al. KDD ) X X X X 8 X8 X 8 X X X X X X X 8 X 8 X 8 X X X Y X4 X 4 X4 X 5 X 5 X 5 Y 8 Y Y X 4 X4 X 4 Y 8 X 5 X 5 X 5 Y X X X Y 4 X6 X 6 X6 Y5 X 7 X 7 X 7 X X X Y 4 X 6 X 6 X 6 Y 5 X 7 X 7 X 7 Y Y 6 Y7 Y Y 6 Y 7 If two networks are drawn from different distributions, how can we transfer a model learned on one network to apply it in another? (see Niu et al. ICDMW 5)

60 What can network science offer to ML? How does network structure impact performance of ML methods? Need to study the properties of the rolled out model network, comparing learning setting with inference setting ML researchers study the effects of generic graph properties on graphical model inference. Should consider network characteristics like connected components, density, clustering, We are starting to study this with synthetically generated attributed networks (see Pfeiffer et al. WWW 4) X X Y X X X X Y X X X Y 4 Y 4 Y 4 Y 4 X 8 X 8 X 8 Y 8 X 5 X 6 X 6 X 6 X 5 Y 5 X 5 X 7 X 7 X 7 Y Y 6 Y 7

61 Questions?

Semi-supervised learning for node classification in networks

Semi-supervised learning for node classification in networks Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Paul Bennett, John Moore, and Joel Pfeiffer)