How to exploit network properties to improve learning in relational domains

How to exploit network properties to improve learning in relational domains Jennifer Neville Departments of Computer Science and Statistics Purdue University!!!! (joint work with Brian Gallagher, Timothy La Fond, Sebastian Moreno, Joseph Pfeiffer and Rongjing Xiang)

Relational network classification examples Predict organizational roles from communication patterns Email networks! Predict protein function from interaction patterns Gene/protein networks! Predict paper topics from properties of cited papers Scientific networks! Predict content changes from properties of hyperlinked pages World wide web! Predict personal preferences from characteristics of friends Social networks! Predict group eﬀectiveness from communication patterns Organizational networks!

Network data is: heterogeneous and interdependent, partially observed/labeled, dynamic and/or non-stationary, and often drawn from a single network...thus many traditional ML methods developed for i.i.d. data do not apply

Machine learning 0 Data representation choose Knowledge representation Generic form is: y = " x + " x...+ " 0 choose defines

Machine learning 0 Model space defines combine Objective function L sq (D) = N D X (f(x i ) y i ) i= choose

Machine learning 0 4 (eg. optimization) combine Search algorithm Learning identifies model with max objective function on training data Model is applied for prediction on new data from same distribution

Relational learning Machine learning 0 Data representation Knowledge representation Email networks! Objective function relational data Social networks! Scientific networks! relational models Bn Bn Firm Broker (Bk) Disclosure Branch (Bn) Bn Size 4 Search algorithm Gene/protein networks! Problem In Past Has Business Is Problem On Watchlist Year Bk Region Type Bk Area Bk Layoffs Bn On Watchlist World wide web! Organizational networks! Bn Bk

There has been a great deal of work on templated graphical model representations for relational data RBNs PRMs RMNs IHRMs MLNs DAPER GMNs RDNs Since model representation is also graphical we need to distinguish data networks from model networks

Data network

Gender? Married? Politics? Religion? Data network

Data representation F N D!C F Y D!C Relational F learning M Mtask: N Y N D D C E.g., predict!c political views C based C on user s intrinsic attributes and political views of friends F N C!C M Y C C F Y D C Estimate joint distribution: or conditional distribution: P (Y {X} n,g) Attributed network P (Y i X i, X R, Y R ) Note we often have only a single network for learning

Define structure of graphical model Politics i Politics j Relational template Politics i Gender i Politics i Married i Politics i Religion i

Y i Y j Relational template Y i X i Y i X i Y i X i Model template

Y i Y j Y i X i + Y i X i Y i X i Model template Data network

Knowledge representation X X 8 X X X 8 X 8 X X X Y X 4 X 4 X 4 X 5 X 5 X 5 Y 8 Y X X X Y 4 X 6 X 6 X 6 Y 5 X 7 X 7 X 7 Y Y 6 Y 7 Model network (graphical model)

Objective 4 Search: eg. convex function optimization X X 8 X X X 8 X 8 X X X Y X 4 X 4 X 4 X 5 X 5 X 5 Y 8 Y X X X Y 4 X 6 X 6 X 6 Y 5 X 7 X 7 X 7 Y Y 6 Y 7 Learn model parameters from fully labeled network P (y G x G )= Z(, x G ) T T C C(T (G)) T (x C, y C ; T )

Apply model to make predictions in another network drawn from the same distribution Y i Y j X X X X X X X 8 X 8 X 8 Y i X i Y + Y X 4 X 4 X 4 Y 8 X 5 X 5 X 5 Y i X i X Y 4 X 6 Y 5 X 7 Y i X i X X X 6 X 6 X 7 X 7 Y Y 6 Y 7 Model template Test network

Collective classification uses full joint, rolled out model for inference but labeled nodes impact the final model structure X X X X X X X X X X X X X 8 X 8 X 8 X 8 X 8 X 8 Y Y Y Y X 4 Y 4 X 4 X 4 Y 4 Y 8 Y 8 X 5 X 5 X 5 X 5 X 5 X 5 X X X X X X Y 4 Y 4 Y 4 X 6 X 6 X 6 Y 5 Y X 5 6 X 6 X 7 X 6 X 7 X 7 X 7 X 7 X 7 Y Y Y 6 Y 6 Y 7 Y 7 Labeled node

Collective classification uses full joint, rolled out model for inference but labeled nodes impact the final model structure X X X The structure X X X of rolled-out 8 relational X X 8 X 8 graphical models are determined by the Y Y X Y 4 X structure of the underlying 8 data 5 network, X 4 X 4 X 5 X 5 Labeled node including location + availability of labels X Y 4 X X X 6 X X 7 X this can impact performance 6 7 of Y Y 6 Y learning and inference methods 7 X 6 Y 5 X 7 via representation, objective function, and search algorithm

Networks are much, much larger in practice

Finding : Representation Implicit assumption is that nodes of the same type should be identically distributed but many relational representations cannot ensure this holds for varying graph structures

I.I.D. assumption revisited Current relational models do not impose the same marginal invariance condition that is assumed for IID models, which can impair generalization p(y A x A ) A B E p(y E x E ) C D F p(y A x A ) 6= p(y E x E ) due to varying graph structure Markov relational network representation does not allow us to explicitly specify the form of the marginal probability distributions, thus it is difficult to impose any equality constraints on the marginals

Is there an alternative approach? Goal: Combine the marginal invariance advantages of IID models with the ability to model relational dependence Incorporate node attributes in a general way (similar to IID classifiers) Idea: Apply copulas to combine marginal models with dependence structure F... t t t tn z z z zn t jointly ~ Copula theory: can construct n-dimensional z vector i = F ( of ) i ( arbitrary i (t i )) marginals while preserving the desired dependence structure... zi marginally ~ Fi

Let s start with a reformulation of IID classifiers... General form of probabilistic binary classification: e.g., Logistic regression p(y i = ) = F ( (x i )) Now view F as the CDF of a distribution symmetric around 0 to obtain a latent variable formulation:! z is a continuous variable, capturing random effects that are not present in x p is the corresponding PDF of F z i p(z i = z x i = x) =f(z (x i )) y i = sign(z i ) In IID models, the random effect for each instance is independent, thus can be integrated out When links among instances are observed, the correlations among their class labels can be modeled through dependence among the z s Key question: How to model the dependence among z s while preserving the marginals?? Zj

Copula Latent Markov Network (CLMN) IID classifiers CLMN The CLMN model Sample t from the desired joint dependency:(t,t,...,t n ) Apply marginal transformation to obtain the latent variable z: z i = F ( ) Marginal Φi transforms ti to uniform [0,] r.v. ui Classification: i ( i (t i )) y i = sign(z i ) Quasi-inverse of CDF Fi is used to obtain zi from ui, Attributes moderate corresponding pdf fi

Copula Latent Markov Network (Xiang and N. WSDM ) CLMN implementation Gaussian Markov network Estimation: First, learn marginal model as if instances were IID Next, learn the dependence model conditioned on the marginal model... but GMN has no parameters to learn Logistic regression Inference: Conditional inference in copulas have not previously been considered for largescale networks For efficient inference, we developed a message passing algorithm based on EP

Experimental Results CLMN SocDim RMN LR GMN Key idea: Ensuring that nodes with varying graph Facebook structure have identical marginals improves learning Gene IMDB IMDB

Finding : Search Graph+attribute space is too large to sample thoroughly, but efficient generative graph models can be exploited to search more effectively

How to efficiently generate attributed graph samples from the underlying joint distribution P (X, Y,G)? Space is O( V +V p ) so effective sampling from joint is difficult

Naive sampling approach: Assume independence between graph/attributes P E (X, E E, X )=P E (E E )P (X X ) Attributes Graph Model Attribute Model

Problem with naive approach Original Sampled Although graph structure can be captured by generative graph models, naive pairing with attribute samples does not capture relational correlation 0-0 - 0- Attribute value combinations

Solution: Use graph model to propose edges, but sample conditional on node attribute values P E (X, E E, X )=P E (E X, E, X )P (X E, X ) Attributes Graph Model use Accept-Reject process to sample conditioned on attrs Attribute Model

Exploit efficient generative graph model as proposal distribution to search effectively What to use as acceptance probabilities? Ratio of observed probabilities in original data to sampled probabilities resulting from naive approach!!!!! Original Original This corresponds to rejection sampling Proposing distribution: True distribution: Sampled 0-0 - 0- Attribute value combinations P E (E ij = E ) P o (E ij = f(x i, x j ), E, X ) 0-0 - 0- Attribute value combinations

Attributed graph models (Pfeiffer, La Fond, Moreno, N & Gallagher WWW 4) # Learn attribute and graph model # Generate graph with naive approach # Compute acceptance ratios # Sample attributes! while not enough edges: draw (vi,vj) from Q (the model) U ~ Uniform(0,) if U < A(xi, xj) put (vi, vj) into the edges return edges 0-0 - 0- Attribute value combinations a Possible Edges g b h f e d i c b f h g

Theorem : AGM samples from the joint distribution of edges and attributes P (E ij = f(x i,x j ), E, F )P (x i,x j X ) Corollary : Expected AGM degree equals expected degree of structural graph model

Empirical results on Facebook data 0.4 Correlation Political views AGM preserves characteristics 0. of graph model 0. 0. 0.0 AGM Key idea: Statistical models of graphs can be exploited to improve sampling from full joint P E (E, X E, X ) AGM captures attribute correlation No AGM Facebook AGM-FCL AGM-TCL AGM-KPGM (x) AGM-KPGM (x) FCL TCL KPGM (x) KPGM (x)

Relational learning Data representation 4 Knowledge representation Objective function Search algorithm Representations affect our ability to enforce invariance assumptions Conventional obj. functions do not behave as expected in partially labeled networks (not in this talk) Simpler (graph) models can be used to statistically prune search space

Conclusion Relational models have been shown to significantly improve predictions through the use of joint modeling and collective inference But since the (rolled-out) model structure depends on the structure of the underlying data network we need to understand how the data graph affects model/algorithm characteristics in order to better exploit relational information for learning/prediction A careful consideration of interactions between: data representation, knowledge representation, objective function, and search algorithm will improve our understanding of mechanisms that impact performance and this will form the foundation for improved algorithms & methodology

Thanks to: Alum ni Hoda Eldardiry Rongjing Xiang Chris Mayfield Karthik Nagaraj Umang Sharan Sebastian Moreno Nesreen Ahmed Hyokun Yun Suvidha Kancharla Tao Wang Timothy La Fond Joel Pfeiﬀer Ellen Lai Pablo Granda Hogun Park

Questions?! neville@cs.purdue.edu www.cs.purdue.edu/~neville