Randomization Tests for Distinguishing Social Influence and Homophily Effects

Size: px
Start display at page:

Download "Randomization Tests for Distinguishing Social Influence and Homophily Effects"

Transcription

1 andomization Tests for Distinguishing Social Influence and Homophily Effects Timothy La Fond and Jennifer Neville Computer Science Department Purdue University West Lafayette, IN [tlafond ABSTACT elational autocorrelation is ubiquitous in relational domains. This observed correlation between class labels of linked instances in a network (e.g., two friends are more likely to share political beliefs than two randomly selected people) can be due to the effects of two different social processes. If social influence effects are present, instances are likely to change their attributes to conform to their neighbor values. If homophily effects are present, instances are likely to link to other individuals with similar attribute values. Both these effects will result in autocorrelated attribute values. When analyzing static relational networks it is impossible to determine how much of the observed correlation is due each of these factors. However, the recent surge of interest in social networks has increased the availability of dynamic network data. In this paper, we present a randomization technique for temporal network data where the attributes and links change over time. Given data from two time steps, we measure the gain in correlation and assess whether a significant portion of this gain is due to influence and/or homophily. We demonstrate the efficacy of our method on semi-synthetic data and then apply the method to a real-world social networks dataset, showing the impact of both influence and homophily effects. Categories and Subject Descriptors H.4.m [Information Systems]: Miscellaneous General Terms Algorithms, Design Keywords Social networks, randomization, homophily, social influence 1. INTODUCTION Autocorrelation is a common characteristic of relational and social network datasets, which refers to a statistical dependency between the values of the same variable on related entities. For example, friends are more likely to share political views than randomly selected pairs of individuals. The presence of autocorrelation offers a unique opportunity to improve predictive models because inferences about one object can be used to improve inferences about related objects. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2010, April 26 30, 2010, aleigh, North Carolina, USA. ACM /10/04. Indeed, recent work in relational learning has exploited this property in the development of collective inference models, which can make more accurate predictions by jointly inferring class label values throughout a network (see e.g., [5, 18, 24]). In addition, the gains that collective model achieve over conditional models (which reason about each instance independently) increase as autocorrelation levels increase in the data [10]. A number of widely occurring phenomena give rise to autocorrelation dependencies. Social phenomena, including social influence [13], diffusion processes [7], and the principle of homophily [15], can cause autocorrelated observations through their influence on social interactions that govern the data generation process. Alternatively, a hidden condition or event, whose influence is correlated among instances that are closely located in time or space, can produce autocorrelated observations through joint influence on link and attribute changes [17, 2]. A key question for understanding and exploiting behavior in social network domains is to determine the root cause of observed autocorrelation. Since autocorrelation is the primary motivation to use relational and network models over conventional machine learning techniques, it stands to reason that a better understanding of the causes of autocorrelation will inform the development of improved models and learning algorithms. For example, although previous work in relational learning and statistical network analysis has focused primarily on static graphs, recent efforts have turned to the analysis of dynamic networks and development of temporally-evolving models (e.g., [9, 21]). In order to deal with the enormous increase in dimensionality associated with modeling both temporal and relational dependencies, these methods restrict the set of dependencies that they consider (e.g., through choice of model form). The ability to accurately distinguish which temporal-relational patterns (e.g., homophily) occur in real-world datasets will ensure that researchers can include the most promising set of dependencies in their restricted set of patterns. esearch in social psychology and sociology has developed two main theories of social processes that can indicate why autocorrelation is often observed in social systems. Social influence refers to processes in which interactions with others causes individuals to conform (e.g., people change their attitudes to be more similar to their friends). Homophily refers to processes of social selection, where individuals are more likely to form ties with similar individuals (e.g., people choose to be friends with people who share their beliefs). Both homophily and social influence can produce autocorre-

2 lation, since their outcome results in linked individual sharing attribute values. In this work we focus on the task of differentiating between influence and homophily effects and determining, from the observed autocorrelation dependencies, whether the effects are significant. ecently, there have been a number of empirical studies that investigate (and model) either social influence or homophily effects in real-world datasets (e.g., [4, 22, 6]. However, these efforts have focused primarily on demonstrating the presence of homophily and influence they do not provide the means to estimate effects sizes from data or determine whether the effects are statistically significant. Exceptions include the work of Snijders et al. [23], Anagnostopoulos et al. [1], and Aral et al. [3]. Snijders et al. [23] develop a time-evolving exponential random graph model that can represent homophily and influence effects. Their method support hypothesis tests for each effect, but the applicability of the approach is limited by the suitability of the model form (e.g., random graph model, Markov assumption). On the other hand, the recent work of Anagnostopoulos et al. [1] presents a model-free approach to assessing influence effects with randomization tests. The limitation of their framework, however, is an assumption that that the network structure (i.e., links) does not change over time, thus they cannot distinguish homophily effects. Aral et al. [3] correct this issue with a development of matched sample estimation framework that accounts for homophily effects, but the method uses additional node behaviors and characteristics in the matching process, so it will have limited applicability in data with few observed attributes and/or time steps. In this paper, we outline a more general randomization framework for datasets where both attribute values and links change over time, where changes can consist of either additions or deletions. Our aim is to determine the significance of each effect and to distinguish the contribution of influence and homophily effects. We outline a randomization test based on randomization of action choices. We consider the gain in correlation over one time step in the graph and assess the amount of gain that is due to each of the effects. The randomization procedure then produce an empirical sampling distribution of expected gains under the null hypothesis (that there is no influence and/or homophily effect) and if the observed gain is greater than expected under the null, we can conclude there is a significant influence/homophily effect. We evaluate our proposed method on semi-synthetic social network data, showing that the test has low Type I error (i.e., it does not incorrectly conclude there is an effect when in fact the data are random) and high power when the data exhibit sufficient change over time (i.e., they correctly conclude there is an effect when there is one). We then apply our method to a real-world dataset to investigate the aspects of observed autocorrelation. Our analysis of a public university Facebook network shows that autocorrelation in group memberships is due to significant influence and homophily effects, yet different groups exhibit different types of behavior. 2. POBLEM DEFINITION In this work, we consider relational data represented as an undirected, attributed graph G = (V, E), with V (nodes) representing objects and E (edges) representing relationships. The nodes V represent objects in the data (e.g., Graph(t) Graph(t+1) Graph(t+2) Influence Influence Attributes(t) Homophily Attributes(t+1) Homophily Attributes(t+1) Figure 1: Illustration of homophily and influence affect on attributes and links over time. people) and the edges E represent relationships (e.g., friendships) between pairs of objects (e ij : v i and v j are friends). Each node v V and has a number of associated attributes X v = (X v 1,..., X v m) (e.g., age, gender). We assume that both the attributes and links may vary over time. First, attribute values may change at each time step t: X t = {X v t } = {(X v 1t,..., X v mt)}. Second, relationships may change at each time step. This results in a different data graph G t = (V, E t) for each time step t, where the nodes remain constant but the edge set may vary (i.e., E t E t for some t, t ). Figure 1 illustrates influence and homophily dependencies. If there is a significant influence effect then we expect the attribute values in t + 1 will depend on the link structure in t. On the other hand, if there is a significant homophily effect then we expect the link structure in t + 1 will depend on the attributes in t. If either influence or homophily effects are present in the data, the data will exhibit relational autocorrelation at any given time step t. elational autocorrelation refers to a statistical dependency between values of the same variable on related objects it involves a set of related instance pairs, a variable X defined on the nodes in the pairs, and it corresponds to the correlation between the values of X on pairs of related instances. Any traditional measure of association, such as Pearson s correlation coefficient or information gain, can be used to assess the association between these pairs of values of X. In this work, we use the chi-square statistic. Definition 1. elational Autocorrelation Let P = {(v i, v j) : e ij E} be a set of related instance pairs in G. Let X be a binary attribute defined on the nodes V. Then we compute the relational autocorrelation of X in G with the following contingency table T : X i = X j = x (X i = X j = x) (v i, v j) P a b (v i, v j) P c d We define relational autocorrelation as the chi-square statistic that is computed from T (with dof=1): C(X, G) = χ 2 = (ad cb) 2 N (a + b)(c + d)(b + d)(a + c) where N = a + b + c + d, is the total count of all cells in T. The first column of the contingency table counts pairs of nodes that both have the same value for attribute X. The second column counts pairs of nodes that do not match on

3 X. The first row counts pairs of nodes that are related in G. The second row counts pairs of nodes that are not linked in G. Note that this contingency table encompasses every possible combination of nodes in the graph and thus has a stable size even when the total number of links change in the graph (i.e., it doesn t depend on the size of E). To measure the autocorrelation between attributes and relationships at time t, we compute the chi-square statistic C(X t, G t) from the graph G t using the attribute values in X t. Using a similar table for the attributes and links in t+1, we can compute C(X t+1, G t+1). Now that we have a method of computing the correlation at any time step, we can can use correlation gain to assess the effects of homophily and influence, where the observed correlation gain G from t to t + 1 is: gain(t, t + 1) = C(X t+1, G t+1) C(X t, G t) The gain in correlation from one time step to the next can be due to: (1) homophily gains due changes in the graph structure in t + 1, or (2) influence gains due to changes in attributes in t + 1. To show this, we will define influence and homophily and show how they impact the chi-square statistic from one time step to the next. Homophily is typically used to refer to the general tendency of people to associate with similar others (see e.g., [15]) we operationalize this in the following way to investigate whether attribute similarity influences choice of friends (i.e., are friendships formed based on a pair s attribute similarity). Definition 2. Homophily Let X t and X t+1 be the attribute values at time t and t + 1 respectively. Let P (t) and P (t+1) be the related pairs at time t and t+1 respectively. Let L ij t+1 refer to the case when pair (v i, v j) form a link at time t + 1 (i.e., (v i, v j) / P (t) and (v i, v j) P (t+1) ). Let U ij t+1 refer to the case when pair (v i, v j) drops a link at time t + 1 (i.e., (v i, v j) P (t) and (v i, v j) / P (t+1) ). Then a dataset exhibits homophily if the following hold: p(l ij t+1 (X i t = X j t = x)) > p(l ij t+1 (X i t = X j t = x)) p(u ij t+1 (X i t = X j t = x)) > p(u ij t+1 (X i t = X j t = x)) In other words, the probability of link formation over time (from t to t + 1) is higher for pairs with matching attribute values and the probability of link dissolution over time is higher for non-matching pairs. Social influence is typically used to refer to the general case of a person s behavior being influenced by others (see e.g., [14]) we operationalize this in the following way to investigate whether a person s friends influence their intrinsic attributes (i.e., are attribute values changed to match one s friends). Definition 3. Social Influence Let X t and X t+1 be the attribute values at time t and t + 1 respectively. Let P (t) and P (t+1) be the related pairs at time t and t + 1 respectively. Let A ij t+1 refer to the case when pair (v i, v j) change their attribute values to agree at time t + 1 (i.e., (Xt i = X j t = x) and (Xt+1 i = X j t+1 = x)). Let D ij t+1 refer to the case when pair (vi, vj) change their attribute values to diverge at time t+1 (i.e., (Xt i = X j t = x) and (Xt+1 i = X j t+1 = x)). Then a dataset exhibits social influence if the following hold: p(a ij t+1 (v i, v j) P (t)) > p(a ij t+1 (v i, v j) / P (t)) p(d ij t+1 (v i, v j) / P (t)) > p(d ij t+1 (v i, v j) P (t)) In other words, the probability of agreement over time (from t to t + 1) is higher for related pairs and the probability of disagreement over time is higher for unrelated pairs. Given these definitions we can show that homophily and influence will result in a correlation gain over time. Theorem 1. Influence Gain Let X t and X t+1 be attribute values at time t and t + 1 respectively and let P (t) be the related pairs at time t. Let k = A ij t+1 be the number of agreements from t to t + 1. Let m = D ij t+1 be the number of disagreements from t to t + 1 and let k = m. Then if an influence effect is present in the data, the autocorrelation will increase when we consider the attribute changes from time t to time t + 1: C(X t+1, G t) > C(X t, G t) The proof of this theorem is included in Appendix A. Theorem 2. Homophily Gain Let P (t) and P (t+1) be the set of related nodes at time t and t + 1 respectively and let X t be the attributes at time t. Let k = L ij t+1 be the number of link additions from t to t+1. Let m = U ij t+1 be the number of link dissolutions from t to t + 1 and let k = m. Then if a homophily effect is present in the data, the autocorrelation will increase when we consider the link changes from time t to time t + 1: C(X t, G t+1) > C(X t, G t) The proof of this theorem follows the same form and argument as for Theorem 1. Now that we have illustrated the connection between gains in autocorrelation and influence/homophily, we can define the correlation decomposition problem as follows. Given an observed network change over two time steps, G t, X t, G t+1, X t+1, determine whether (1) the attribute changes from t to t + 1 exhibit a significant amount of influence, and (2) the link changes from t to t + 1 exhibit a significant amount of homophily. In section 4, we will outline a novel randomization test to separate these effects and assess their significance. 3. ELATED WOK Current approaches relevant to this work fall into three categories: empirical investigation and modeling of social influence and homophily effects, significance tests for relational and social network data, and modeling techniques for distinguishing homophily and influence effects. esearchers in social psychology and sociology have studied social influence and homophily for much of the past forty years (see [14] for an extensive review). This work has focused primarily on developing theory about underlying psychological processes such as persuasion, conformity, assimilation, and selection. Experimental investigation is typically performed in smaller-scale laboratory environments, with individual or dyad-level analysis. Consequently, the modeling techniques do not need to model the interdependence among

4 individuals, nor do they need to be applicable for large-scale networks of thousands of nodes. ecently, with the surge of interest in online social networks and relational data, there has been a growing body of research in the data mining community that has focused on modeling network and behavior change over time. For example, Backstrom et al. [4] investigated the evolution of network structure and group membership in MySpace and LiveJournal and showed that homophily can be used to improve predictive models of group membership. Singla and ichardson [22] investigate the correlation between individual search topics among people that interact using instant messaging, and show that not only does a correlation exist but that it increases with the amount of time the users communicate. Crandell et al. [6] study the temporal evolution of link structure and attribute similarity in Wikipedia and propose a mathematical model that includes both influence and homophily effects to predict future behavior in the network. The majority of this recent work is empirically-based focused primarily on demonstrating the presence of homophily and influence in real-world data. The proposed methods and analysis techniques do not provide the means to estimate effects sizes from data or determine whether the effects are statistically significant. There has been some work on developing significance tests for social network and relational domains, but this work has focused primarily on static networks, so the null models clearly limit the types of conclusions that can be drawn from the analysis. For example, Milo et al. [16] generate randomized network structures while holding node degree constant to assess whether subgraph motifs are observed a significantly higher frequency than would be expected due to random chance. Karrer et al. [12] consider the significance of community structure in the network, by using network perturbations to assess the variance of community patterns. Jensen et al. [11] propose attribute-based randomization tests to accurately assess the significance of feature association in the presence of relational autocorrelation. Eldardiry and Neville [8] develop resampling procedures for attributed networks that can also be used to assess variance of feature scores in static networks. In addition, work in sociology on exponential random graph models (see e.g., [20]) includes methods for determining the significance of parameter estimates learned from data (again on static networks). The work most relevant to our proposed method is that of Snijders et al. [23], Anagnostopoulos et al. [1], and Aral et al. [3]. Snijders et al. [23] extend the exponential random graph models to incorporate time evolution (in both attributes and links) with a Markov model assumption (i.e., attribute/links at one time step depend only the previous time step). The method facilitates the inclusion of social influence and homophily dependencies in the model and generalized Neyman-ao score tests, based on methods of moments estimators, are used to test hypotheses of the form: H 0 : θ i = 0, H 1 : θ i 0. The main limitation of this approach is that it is model-based so the accuracy of hypothesis tests will be impacted by the suitability of the model form (e.g., random graph model, Markov assumption). In contrast to the work of Snijders et al. [23], our method is a data-driven, model-free approach based on randomization tests. The recent work of Anagnostopoulos et al. [1] also outlines a randomization procedure for assessing whether an observational datasets exhibits a significant influence effect. However, their framework assumes that the network structure (i.e., links) does not change over time, thus they do not present a method for assessing whether a significant homophily effect is present as well. In particular, their timestamp shuffling test requires a longitudinal view of the evolution of the data. To estimate the effects of social influence, they compare the number of people the adopt a trait, given they have a neighbors with that trait already, to the number of people that do not adopt, given the same a neighbors with the trait. This calculation requires future knowledge about who will not adopt a trait. Otherwise, if the comparison is made using a single time step, the vast majority of users will not have adopted and the effect will lost in the noise. Furthermore, their method only considers data where attributes are added over time. Concurrent work by Aral et al. [3] has corrected the main limitation of [1] in their development of a matched sample estimation framework, which accounts for homophily effects as well as influence. However, their method uses additional node behaviors and characteristics in the matching process, so it will have limited applicability in data with few observed attributes and/or time steps. In this work, we outline a general randomization framework where both attribute values and links change over time, changes can consist of additions or deletions, and focus on assessing both influence and homophily effects in data with few available time slices. 4. METHOD andomization tests are a model-free, computationallyintensive statistical technique for hypothesis testing [19]. The tests generate many replicates of an actual data set typically called pseudosamples and uses the pseudosamples to estimate a score distribution. Pseudosamples are generated by randomly reordering (or permuting) the values of one or more variables in an actual data set. A score is then calculated for each pseudosample, and the distribution of these randomized scores is used to estimate a sampling distribution for the score statistic under the null hypothesis. The value of the observed score on the original data is then compared to the distribution of scores on the randomized pseudosamples, and if it is significantly higher (or lower) than this distribution, the observed score will be deemed significant. In contrast to conventional hypothesis tests, randomization tests make a relatively small number of assumptions about the data. For example, randomization tests make no assumptions about the form of the distributions from which variable values are drawn. In addition, they can be used to form sampling distributions for estimators whose precise statistical properties are not known. The key issue in developing a randomization test is to formulate an appropriate null hypothesis and permute the data in a way that accurately reflects the null hypothesis. Tables 1-2 outline the specific significance tests that we use throughout this work. The significance tests determine whether an observed gain in autocorrelation is significant by using a randomization test to estimate an empirical sampling distribution of gains that would be expected if change in links (attributes) is random and thus not due to homophily (influence). The empirical sampling distribution is estimated from the gains observed in pseudosamples generated by the random-

5 HomophilySigT est(g t, G t+1, X t, X t+1, numiters, α) gains = // compute original gain gain O = C(X t, G t+1) C(X t, G t) // randomize For iter in 1.. numiters G t+1 = andomize(g t, G t+1) Compute gain r = C(X t, G t+1) C(X t, G t) gains = gains {gain r} // test significance of gain If gain O > 1 α critical value of gains 2 eturn significant/positive Else if gain O < α critical value of gains 2 eturn significant/negative Else eturn not significant Table 1: Homophily significance test method InfluenceSigT est(g t, G t+1, X t, X t+1, numiters, α) gains = // compute original gain gain O = C(X t+1, G t) C(X t, G t) // randomize For iter in 1.. numiters X t+1 = andomize(x t, X t+1) Compute gain r = C(X t+1, G t) C(X t, G t) gains = gains {gain r} // test significance of gain If gain O > 1 α critical value of gains 2 eturn significant/positive Else if gain O < α critical value of gains 2 eturn significant/negative Else eturn not significant Table 2: Influence significance test method ization procedure. The method compares the observed gain value to the empirical sampling distribution, if the value is higher than (1 α )% (or lower than α %) of the scores observed in the randomized data, the gain is deemed to be 2 2 significant and the null is rejected. The gain in autocorrelation from one time step to the next can be due to: (1) homophily gains due to friend changes in t + 1, or (2) influence gains due to changes in attributes in t + 1. To separate the effects of influence and homophily, we define two different randomization tests to use as the andomize( ) method inside the significance test. The key issue in developing a randomization test is to formulate an appropriate null hypothesis and permute the data in a way that accurately reflects the null hypothesis. We formulate three null hypotheses with respect to homophily and influence: H H 0 : link changes are random and are not due to attribute values in t (i.e., no homophily effect) H I 0 : attribute changes are random and are not due to friends in t (i.e., no social influence effect) H F 0 : both attribute and link changes are random (i.e., no homophily nor influence effect) To identify possible permutations for these null hypotheses, we consider four types of data changes that can occur in the data from time t to time t + 1: Edge additions: + E = {eij Et+1 eij Et} Edge deletions: E = {eij Et eij Et+1} Attribute additions: + X = {xv X t+1 x v X t} Attribute deletions: X = {xv X t x v X t+1} Note that attribute value changes can be easily modeled as an addition/deletion pair. Clearly, homophily will impact edge changes and influence will affect attribute changes. For the null hypothesis concerning homophily (H0 H ), we want to randomize the edge changes to remove any association with attribute values in t. To do this, we can randomize the choice of edge target so that it does not depend on the attributes of the source node. For example, if node i adds a link to node j at time t + 1, then we can maintain the edge addition in t + 1 but randomize the choice of target node j to replace e ij with e ij so that any association of attribute similarity between i and j is destroyed. However, to ensure that the degree of j remains the same after randomization, j must have been part of an edge addition e kj in the original set. The randomization procedure for edges can be thought of as swapping the endpoints of edge additions/deletions such that each node will have the same number of additions and deletions in the randomized set, but the partner of those links will have changed. andomization for the null hypothesis concerning influence (H0 I ) will follow a similar procedure by swapping attribute adoptions and abandonments between nodes, removing any influence of edges in t. If node i adds a attribute value x at time t + 1, then we can maintain the attribute addition in t+1 but randomize the choice of value to replace x with x so that any similarity of the attribute value x with the attribute values of i s linked friends in t is destroyed. We call the procedures based on this form of randomization choice-based methods, since they randomize the results of choices (attribute/link changes). Tables 3-4 outline the specifics of the choice-based method for H0 H (homophily) and H0 I (influence) respectively. We can combine the two methods, randomizing both the attribute and link changes to estimate a distribution for H0 F. However, calculating choice-based randomizations are nontrivial. A particular target edge or attribute can be selected for swapping only if it has not been selected before, and nodes cannot add edges or attributes if they had them in time step t, nor can they drop edges or attributes if they lack them in t. Given these sets of constraints, it may be difficult to find a valid random assignments (apart from the original), which is problematic for a test that depends on generating a distribution of random pseudosamples. We address this issue by taking a greedy assignment approach. First, we collate the edge and attribute changes such that all additions and deletions for a node or attribute can be decided at once. Then, we sort the nodes and attributes from those with the least number of random options to those with the largest number of random options. andom options here refers to the amount of freedom the node has when selecting additions or deletions and is given by the number of available selections minus the number of assignments needed. This value can be calculated for edge

6 andomize hom choice(g t, G t+1) + E = {eij Et+1 eij Et} (added links in t + 1) E = {eij Et eij Et+1} (dropped links in t + 1) // targets for random selections T + = + E T = E // randomize For e ij + E andomly select e kj T +, where j Et i eplace e ij in G t+1 with e ij emove e kj from T + For e ij E andomly select e kj T, where j Et i Add e ij to G t+1 emove e ij from G t+1 emove e kj from T eturn (G t+1) Table 3: Choice-based randomization method for assessing homophily andomize inf choice (Xt, Xt+1) + X = {xv X t+1 x v X t} (added attributes in t + 1) X = {xv X t x v X t+1} (dropped attributes in t + 1) // targets for random selections T + = + X T = X // randomize For x v + X andomly select x u T +, where x u Xt u eplace x v in X t+1 with x v emove x u from T + For x v X andomly select x u T, where x u Xt u Add x v to X t+1 emove x u from X t+1 emove x u from T eturn (X t+1) Table 4: Choice-based randomization method for assessing influence additions by e kj T : j Et i e ij, and similar values can be calculated for the deletions, as well as attribute cases. The assumption in the greedy approach is that nodes and attributes with many random options, at the start, are unlikely to run out of available options even if their assignments are decided later in the algorithm. However, the greedy approach cannot guarantee that a node or attribute will have a valid random option at the point in the algorithm in which it is assigned. If this is the case, the particular node or attribute will retain its original assignment. This will not preserve the degree of nodes and attributes in the original data (since those original assignments may already have been given to others in the randomization). However, since the assignment for that node/attribute is identical to the original, it will prevent Type I errors from occurring. Ideally, after assigning additions and deletions to a node or attribute we could recompute the random options available to the remaining nodes/attributes and resort the list of edges/deletions. However, recomputing in this manner during the algorithm is computationally very expensive, and did not provide an increase in performance in practice, so we do not consider it further. Finally, in addition to determining the significance of each effect, the sampling distributions for the randomization tests may also provide an estimate of the expected gain for each effect independently (where the full randomization is used to remove any joint effects from the estimation of gains due to homophily and influence): E[gain hom ] = µ gains I µ gains F E[gain inf ] = µ gains H µ gains F E[gain oth ] = µ gains F We can define the overall expected gain as a function of these independent components to compare the proportion of gain due to each effect: E[gain obs ] = E[gain hom ] + E[gain inf ] + E[gain oth ] = [µ gains I µ gains F ] + [µ gains H µ gains F ] + µ gains F = µ gains I + µ gains H µ gains F 5. SYNTHETIC DATA EXPEIMENTS This section describes our investigation of the characteristics of the proposed randomization tests on semi-synthetic data. We are interested in two characteristics of statistical tests. (1) Type I error: the probability of rejecting a true null hypothesis (i.e., incorrectly concluding that there is a significant effect when there is not). (2) Power: the probability of rejecting a false null hypothesis (i.e., correctly concluding that there is a significant effect when there is one). If a statistical test has elevated levels of Type I error, that implies that many of the conclusions we draw from the test may be incorrect. In contrast, if a statistical test has low statistical power, that implies that legitimate performance differences may not be detected as significant. We start with a base of real-world social network data and use the distributions of observed changes to generate data with different characteristics. On data with random changes we evaluate the Type I error of the test; on data with simulated homophily/influence effects, we evaluate the statistical power of the tests. The results show that our proposed tests have low Type I error and power increases as the number of data changes increase. 5.1 Data Using a small subset of the real-world data gathered from Facebook (see section 6) as base for time t, we generated semi-synthetic data sets for time t + 1 designed to either maximize or minimize the presence of homophily or influence effects. The synthetic data in time t + 1 uses the distribution of changes in the original Facebook sample (i.e., number of adds/drops per person), however our data generation procedure chooses a new set of changes to ensure the data have certain characteristics (e.g., homophily). More specifically, we hold G t and X t constant but generate new data for G t+1 and X t+1 to create three types of datasets:

7 andom: Changes are made randomly so there is no homophily or influence effect. Homophily-rich: Attribute changes are made randomly, link changes are designed to maximize homophily. Influence-rich: Link changes are made randomly, attribute changes are designed to maximize influence. To generate data with homophily, link additions are chosen to maximize similarity among the incident nodes. When selecting a new link, the following weights are assigned to each possible link: 1+γ (#overlaps). Here #overlaps is the number of attribute values that the nodes share. The link weights are then normalized over all possible new links (pairs of unlinked nodes) to produce a probability of selecting any given link. The probabilities are used to randomly select the appropriate number of link additions, while weighting the likelihood heavily toward similar pairs of nodes. Dropping links is done in a similar manner except that this probability is calculated across current links in t and the inverse is used to drop links among pairs of nodes that are most dissimilar. To generate data with influence, we add attribute values in a similar way. Here we again compute a weight for each attribute value: 1 + γ (#overlaps), but #overlaps is defined to be the number of neighbors who have the attribute value under consideration. Likewise the inverse is used to determine which attribute values are dropped. 5.2 Methodology Type I errors correspond to cases when the null hypothesis is incorrectly rejected in other words, false positive assessments of significance, when there is in fact no significant homophily/influence effect. To estimate the Type I error rate for each of the tests, we generated data with random changes in time t+1. Thus any observed gain in C is entirely random so any assessment of significance will correspond to a Type I error. To evaluate the Type I error of the tests, we used the procedure outlined in Table 5. Type II errors correspond to cases when the null hypothesis is incorrectly accepted in other words, false negative assessments of significance, when there is in fact a significant homophily/influence effect. Power is the complement of the Type II error rate the proportion of significant effects that are correctly identified (1 P (T ypeii)). To estimate Type II error of the tests, we generated data for time t+1 with either homophily or influence effects. Thus all observed gains in C should be deemed significant any test that fails will correspond to a Type II error. To evaluate the statistical power of the tests, we used the procedure outlined in Table Experimental esults We evaluated the Type I error rates of each test using semi-synthetic data, where the attribute and link changes are made at random. The power of the homophily and influence tests were evaluated using the Homophily-rich and Influence-rich data, respectively. For these experiments, unless otherwise noted, we used N = 20, α = 0.05, γ = 100, and report average rates over 50 attribute values (i.e., group meberships). The first column of Table 7 reports the Type I rates of the choice-based randomization test. Since we used α = 0.05, we expect the error rates to be less than This is indeed T ypeierror(g t, X t, G, X, N, α) For i in 1..N Generate G i t+1, Xt+1 i with random changes numincorrectsigt ests = 0 For j in 1..N SignificanceT est(g t, X t, G i t+1, Xt+1, i N, α) If significant: numincorrectsigtests++ typeierror(i) = P 1 numincorrectsigt ests N avgt ypeierror = 1 N i typeierror(i) Table 5: Method to measure Type I error rate. P ower(g t, X t, G, X, N, α) For i in 1..N Generate G i t+1, Xt+1 i with homophily/influence numincorrectsigt ests = 0 For j in 1..N SignificanceT est(g t, X t, G i t+1, Xt+1, i N, α) If not significant: numincorrectsigtests++ typeiierror(i) P = 1 numincorrectsigt ests N avgp ower = 1 N i (1 typeiierror(i)) Table 6: Method to measure statistical power. the case for both the homophily and the influence tests, indicating that the tests are likely to be accurate in practice. Next we evaluated statistical power. In this case we created data with only the effect we were trying to identify: (1) random attribute changes, homophily-based link changes (2) random link changes, influence-based attribute changes. These results are shown in columns 2 and 3 of Table 7. The average power of the test for identifying homophily was 0.57, meaning an effect was correctly identified 57% of the time on data generated to include homophily. The average power for detecting influence based on our initial semisynthetic datasets was significantly lower at We conjectured that this was due to the presence of fewer changes in attribute values in the original data, compared to changes in the link structure. In particular, many fewer attribute values were being added in the Facebook sample, the average number of link adds per node was 4.08 while the average number of attribute adds was 0.26 for the same set. Figure 2(a) shows the interaction between the number of attribute additions and statistical power. As we increase the number of additions for each node in the graph, the power increase, reaching a maximum of 0.65 given a γ of 50. This is partially due to an increase in effect size (i.e., increase in overall level of influence) but the quantity of attribute changes adds an additional effect, over and above any increase in correlation gain due to increased influence. The synthetic data generates homophily and influence by selecting link and attribute changes such that the correlation between node neighbors is maximized. If there are no such changes available, there will be less homophily and influence present in the synthetic data which degrades the power of the randomization tests. In addition, the distribution after randomizing will be closer to the original data as fewer changes can be randomized, which also reduces the power of the tests. In future work, we will attempt to tease apart these effects. Figure 2(b)-2(c) shows the increase in statistical power

8 Power Power Power Boost to Attribute Additions (pernode) (a) Influence test power, as number of group additions increases Chi Gain (b) Influence test power, as effect size increases Chi Gain (c) Homophily test power, as effect size increases. Figure 2: Power analysis of influence and homophily randomization tests. P(TypeI) Power Power and. X and. X Infl. X and. G Hom. G and. G Homophily test na Influence test 0.04 na 0.76 Table 7: Type I error and power for the choice-based randomization method. as we systematically decrease the effect size for either influence or homophily. Specifically, we varied γ to change the probability of selecting links and groups that produce autocorrelation in the synthetic data. For the influence test we used γ values of 1, 10, 20, 50 and for the homophily test values of 1, 10, 50, 100. We then plotted the expected power of the tests against the median chi gain of the groups. Greater chi gain places the group further from the null distribution, increasing the effect size. As expected, the power of each test decreases as the effect size decreases. 6. EAL DATA EXPEIMENTS 6.1 Data We evaluated our approach on data from the public Purdue Facebook network. Facebook is a popular online social network site with over 250 million members worldwide. Members create and maintain a personal profile page, which contains information about their views, interests, and friends, and can be listed as private or public. Friendship links are undirected and are formed through an invitation by one user along with a confirmation by the other. To be affiliated with a University network, users must have a valid account within the appropriate domain (e.g., purdue.edu), thus the members consist of students, faculty, staff, and alumni. The network we considered comprised more than 3 million public friendship links among 56,000 members. Users had an average and median degree of 46 and 81 respectively. In addition to the friendship links, we considered a set of attributes corresponding to public group membership. Group membership information is posted in the users profile pages. Each group maintains a separate page reflecting some in- Homophily Influence Significant Significant Significant Significant Table 8: Number of groups detected by randomization tests terest (e.g., friends of AAAI), and users who share that interest can become members of the group. For this work, we considered the set of 2648 (public) Facebook users belonging to the class of 2011 student network. For the first time step, we used the friendship links and group memberships from March For the second time step we used friendship links and group memberships from March The students in this sample belong to 494 groups, so we consider each group membership as a binary attribute that can change from one time step to the next. 6.2 Experimental results To investigate homophily and influence in Facebook, we computed the observed correlation gain for each group membership attribute from t = 2008 to t + 1 = We then applied the choice-based randomization procedure to determine if the gains exhibited significant homophily and/or influence effects. Due to the low type I error of the choicebased test, the discovered significant patterns are likely to be correct. However, the low power for influence identification (given the observed number of attribute changes in the sample) means that the influence test may not be able detect all effects in the data. Of the 494 groups, notably 143 (29%) exhibited a significant correlation gain of some type. The assessments of significance are summarized in Table 8. Note that there are more groups that exhibit significant homophily effects (118) compared to significant influence effects (32). This is likely due the larger number of link changes, which results in higher power for the homophily test. To explore the types of groups exhibiting each type of effect, we examined a set of group names selected at random from each cell in Table 8. Table 9 list some examples from each category. Groups with significant homophily effects seem to include

9 Homophily and Influence (7 groups) Purdue Habitat for Humanity Tell 10 to Tell 10 Levee Tan I started doing homework but I ended up on Facebook Homophily Only (111 groups) Purdue Capture the Flag Honors Engineering Community Boiler Gold ush 2008 Purdue Opportunity Awards Influence Only (25 groups) NOBAMA IN 08 I bet I can still find 1,000,000 people who dislike George Bush Hokay, so here s the Earth i need numberss asap No Effect (351 groups) I support Welcome Home We miss Cody Lehe 4-H alumni Harrison Band Table 9: Example groups with each possible combination of effect. opportunities for members to meet in person. For example, Boiler Gold ush is a freshman orientation program for Purdue where members are likely to meet, members of the Capture the Flag group presumably meet to play the game, and Levee Tan is a local tanning salon. Groups with significant influence effects seem to have a political or activist aspect to them. This includes anti-obama and anti-bush groups as well groups like Habitat for Humanity and Tell 10 to Tell 10, which is a breast cancer awareness group. Another group of note is i need numberss [sic] asap. This group was created by a user who had lost his phone and wanted his friends to post their phone numbers to the group wall. The members of this group already had friendship links between them and were joining the group to share phone numbers with other friends which naturally produces a detectable influence effect. 7. CONCLUSION AND FUTUE WOK In this paper, we presented a novel randomization procedure to investigate the causes of observed autocorrelation in network data. The test focuses on distinguishing social influence effects from homophily effects, and enables the accurate assessment of whether the effects are statistically significant. The advantage of the proposed choice-based method includes: (1) a model-free approach which makes a relatively small number of assumptions about the data, (2) the ability to assess both homophily and influence effects, and (3) low Type I error with reasonable levels of power that increase as the number of changes in the data increase. We evaluated our proposed methods on semi-synthetic social network data, showing efficacy of the approach. We then applied the choice-based method to a real-world dataset to investigate the aspects of its observed autocorrelation. Our analysis of a public university Facebook network shows that autocorrelation in group memberships is due to significant influence and homophily effects. However, different groups exhibit different behavior, which indicates homophily and influence vary with respect to group properties. In future work, we plan to investigate this variability more deeply, using multiple time steps to control for the amount of change expected in a single time step and investigating the association of effects with other group properties (e.g., density, popularity). In addition, herein we have only considered randomization procedure for first-order effects (i.e., dyad-level dependencies). Considering second-order effects such as structural similarity and community level change may help to decompose additional external effects that are not explicitly encoded in the data, but are implicit in the temporal-relational dynamics. Although we have investigated the characteristics of our modeling approach on social network data, the methods are broadly applicable to relational and network domains that are changing over time. elational data often record information about people (e.g., organizational structure, transactions) or about artifacts created by people (e.g., citation networks, World Wide Web) so it is likely that social phenomena such as homophily and influence will contribute to autocorrelated observations in a wide array of relational domains. Acknowledgments This research is supported by DAPA and NSF under contract numbers NBCH and SES The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright notation hereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements either expressed or implied, of DAPA, NSF, or the U.S. Government. 8. EFEENCES [1] A. Anagnostopoulos,. Kumar, and M. Mahdian. Influence and correlation in social networks. In KDD 08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 7 15, [2] L. Anselin. Spatial Econometrics: Methods and Models. Kluwer Academic Publisher, The Netherlands, [3] S. Aral, L. Muchnika, and A. Sundararajan. Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51): , [4] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. Group formation in large social networks: membership, growth, and evolution. In KDD 06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 44 54, [5] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages , 1998.

Supporting Statistical Hypothesis Testing Over Graphs

Supporting Statistical Hypothesis Testing Over Graphs Supporting Statistical Hypothesis Testing Over Graphs Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Tina Eliassi-Rad, Brian Gallagher, Sergey Kirshner,

More information

A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation

A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation Pelin Angin Department of Computer Science Purdue University pangin@cs.purdue.edu Jennifer Neville Department of Computer Science

More information

Web Structure Mining Nodes, Links and Influence

Web Structure Mining Nodes, Links and Influence Web Structure Mining Nodes, Links and Influence 1 Outline 1. Importance of nodes 1. Centrality 2. Prestige 3. Page Rank 4. Hubs and Authority 5. Metrics comparison 2. Link analysis 3. Influence model 1.

More information

How to exploit network properties to improve learning in relational domains

How to exploit network properties to improve learning in relational domains How to exploit network properties to improve learning in relational domains Jennifer Neville Departments of Computer Science and Statistics Purdue University!!!! (joint work with Brian Gallagher, Timothy

More information

Measuring Social Influence Without Bias

Measuring Social Influence Without Bias Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from

More information

Graph Detection and Estimation Theory

Graph Detection and Estimation Theory Introduction Detection Estimation Graph Detection and Estimation Theory (and algorithms, and applications) Patrick J. Wolfe Statistics and Information Sciences Laboratory (SISL) School of Engineering and

More information

Controlling for confounding network properties in hypothesis testing and anomaly detection

Controlling for confounding network properties in hypothesis testing and anomaly detection Purdue University Purdue e-pubs Open Access Dissertations Theses and Dissertations 8-2016 Controlling for confounding network properties in hypothesis testing and anomaly detection Timothy La Fond Purdue

More information

Tied Kronecker Product Graph Models to Capture Variance in Network Populations

Tied Kronecker Product Graph Models to Capture Variance in Network Populations Tied Kronecker Product Graph Models to Capture Variance in Network Populations Sebastian Moreno, Sergey Kirshner +, Jennifer Neville +, SVN Vishwanathan + Department of Computer Science, + Department of

More information

Modeling Dynamic Evolution of Online Friendship Network

Modeling Dynamic Evolution of Online Friendship Network Commun. Theor. Phys. 58 (2012) 599 603 Vol. 58, No. 4, October 15, 2012 Modeling Dynamic Evolution of Online Friendship Network WU Lian-Ren ( ) 1,2, and YAN Qiang ( Ö) 1 1 School of Economics and Management,

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

HYPOTHESIS TESTING. Hypothesis Testing

HYPOTHESIS TESTING. Hypothesis Testing MBA 605 Business Analytics Don Conant, PhD. HYPOTHESIS TESTING Hypothesis testing involves making inferences about the nature of the population on the basis of observations of a sample drawn from the population.

More information

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions. The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions. A common problem of this type is concerned with determining

More information

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging Chapter IR:VIII VIII. Evaluation Laboratory Experiments Performance Measures Logging IR:VIII-62 Evaluation HAGEN/POTTHAST/STEIN 2018 Statistical Hypothesis Testing Claim: System 1 is better than System

More information

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan Link Prediction Eman Badr Mohammed Saquib Akmal Khan 11-06-2013 Link Prediction Which pair of nodes should be connected? Applications Facebook friend suggestion Recommendation systems Monitoring and controlling

More information

Dynamics in Social Networks and Causality

Dynamics in Social Networks and Causality Web Science & Technologies University of Koblenz Landau, Germany Dynamics in Social Networks and Causality JProf. Dr. University Koblenz Landau GESIS Leibniz Institute for the Social Sciences Last Time:

More information

Lecture 5: Sampling Methods

Lecture 5: Sampling Methods Lecture 5: Sampling Methods What is sampling? Is the process of selecting part of a larger group of participants with the intent of generalizing the results from the smaller group, called the sample, to

More information

Appendix: Modeling Approach

Appendix: Modeling Approach AFFECTIVE PRIMACY IN INTRAORGANIZATIONAL TASK NETWORKS Appendix: Modeling Approach There is now a significant and developing literature on Bayesian methods in social network analysis. See, for instance,

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Prediction of Citations for Academic Papers

Prediction of Citations for Academic Papers 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Statistics 3858 : Contingency Tables

Statistics 3858 : Contingency Tables Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson

More information

Using Bayesian Network Representations for Effective Sampling from Generative Network Models

Using Bayesian Network Representations for Effective Sampling from Generative Network Models Using Bayesian Network Representations for Effective Sampling from Generative Network Models Pablo Robles-Granda and Sebastian Moreno and Jennifer Neville Computer Science Department Purdue University

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Specification and estimation of exponential random graph models for social (and other) networks

Specification and estimation of exponential random graph models for social (and other) networks Specification and estimation of exponential random graph models for social (and other) networks Tom A.B. Snijders University of Oxford March 23, 2009 c Tom A.B. Snijders (University of Oxford) Models for

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Chi-Square. Heibatollah Baghi, and Mastee Badii

Chi-Square. Heibatollah Baghi, and Mastee Badii 1 Chi-Square Heibatollah Baghi, and Mastee Badii Different Scales, Different Measures of Association Scale of Both Variables Nominal Scale Measures of Association Pearson Chi-Square: χ 2 Ordinal Scale

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Political Science 236 Hypothesis Testing: Review and Bootstrapping

Political Science 236 Hypothesis Testing: Review and Bootstrapping Political Science 236 Hypothesis Testing: Review and Bootstrapping Rocío Titiunik Fall 2007 1 Hypothesis Testing Definition 1.1 Hypothesis. A hypothesis is a statement about a population parameter The

More information

Sampling of Attributed Networks from Hierarchical Generative Models

Sampling of Attributed Networks from Hierarchical Generative Models Sampling of Attributed Networks from Hierarchical Generative Models Pablo Robles Purdue University West Lafayette, IN USA problesg@purdue.edu Sebastian Moreno Universidad Adolfo Ibañez Viña del Mar, Chile

More information

Data Mining. CS57300 Purdue University. March 22, 2018

Data Mining. CS57300 Purdue University. March 22, 2018 Data Mining CS57300 Purdue University March 22, 2018 1 Hypothesis Testing Select 50% users to see headline A Unlimited Clean Energy: Cold Fusion has Arrived Select 50% users to see headline B Wedding War

More information

Do not copy, post, or distribute

Do not copy, post, or distribute 14 CORRELATION ANALYSIS AND LINEAR REGRESSION Assessing the Covariability of Two Quantitative Properties 14.0 LEARNING OBJECTIVES In this chapter, we discuss two related techniques for assessing a possible

More information

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search 6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search Daron Acemoglu and Asu Ozdaglar MIT September 30, 2009 1 Networks: Lecture 7 Outline Navigation (or decentralized search)

More information

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity CS322 Project Writeup Semih Salihoglu Stanford University 353 Serra Street Stanford, CA semih@stanford.edu

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Modelling self-organizing networks

Modelling self-organizing networks Paweł Department of Mathematics, Ryerson University, Toronto, ON Cargese Fall School on Random Graphs (September 2015) Outline 1 Introduction 2 Spatial Preferred Attachment (SPA) Model 3 Future work Multidisciplinary

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Introduction to statistical analysis of Social Networks

Introduction to statistical analysis of Social Networks The Social Statistics Discipline Area, School of Social Sciences Introduction to statistical analysis of Social Networks Mitchell Centre for Network Analysis Johan Koskinen http://www.ccsr.ac.uk/staff/jk.htm!

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Computational Tasks and Models

Computational Tasks and Models 1 Computational Tasks and Models Overview: We assume that the reader is familiar with computing devices but may associate the notion of computation with specific incarnations of it. Our first goal is to

More information

Psych Jan. 5, 2005

Psych Jan. 5, 2005 Psych 124 1 Wee 1: Introductory Notes on Variables and Probability Distributions (1/5/05) (Reading: Aron & Aron, Chaps. 1, 14, and this Handout.) All handouts are available outside Mija s office. Lecture

More information

CMPUT651: Differential Privacy

CMPUT651: Differential Privacy CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Parameter Estimation, Sampling Distributions & Hypothesis Testing Parameter Estimation, Sampling Distributions & Hypothesis Testing Parameter Estimation & Hypothesis Testing In doing research, we are usually interested in some feature of a population distribution (which

More information

Towards Association Rules with Hidden Variables

Towards Association Rules with Hidden Variables Towards Association Rules with Hidden Variables Ricardo Silva and Richard Scheines Gatsby Unit, University College London, WCN AR London, UK Machine Learning Department, Carnegie Mellon, Pittsburgh PA,

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Hypothesis Testing hypothesis testing approach

Hypothesis Testing hypothesis testing approach Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Modeling Strategic Information Sharing in Indian Villages

Modeling Strategic Information Sharing in Indian Villages Modeling Strategic Information Sharing in Indian Villages Jeff Jacobs jjacobs3@stanford.edu Arun Chandrasekhar arungc@stanford.edu Emily Breza Columbia University ebreza@columbia.edu December 3, 203 Matthew

More information

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals Lecture 9 Justin Kern April 9, 2018 Measuring Effect Size: Cohen s d Simply finding whether a

More information

Network data in regression framework

Network data in regression framework 13-14 July 2009 University of Salerno (Italy) Network data in regression framework Maria ProsperinaVitale Department of Economics and Statistics University of Salerno (Italy) mvitale@unisa.it - Theoretical

More information

Modeling of Growing Networks with Directional Attachment and Communities

Modeling of Growing Networks with Directional Attachment and Communities Modeling of Growing Networks with Directional Attachment and Communities Masahiro KIMURA, Kazumi SAITO, Naonori UEDA NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho, Kyoto 619-0237, Japan

More information

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics Mathematics Curriculum A. DESCRIPTION This is a full year courses designed to introduce students to the basic elements of statistics and probability. Emphasis is placed on understanding terminology and

More information

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into

More information

Two-sample Categorical data: Testing

Two-sample Categorical data: Testing Two-sample Categorical data: Testing Patrick Breheny October 29 Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/22 Lister s experiment Introduction In the 1860s, Joseph Lister conducted a landmark

More information

Psych 230. Psychological Measurement and Statistics

Psych 230. Psychological Measurement and Statistics Psych 230 Psychological Measurement and Statistics Pedro Wolf December 9, 2009 This Time. Non-Parametric statistics Chi-Square test One-way Two-way Statistical Testing 1. Decide which test to use 2. State

More information

Testing for Regime Switching in Singaporean Business Cycles

Testing for Regime Switching in Singaporean Business Cycles Testing for Regime Switching in Singaporean Business Cycles Robert Breunig School of Economics Faculty of Economics and Commerce Australian National University and Alison Stegman Research School of Pacific

More information

psychological statistics

psychological statistics psychological statistics B Sc. Counselling Psychology 011 Admission onwards III SEMESTER COMPLEMENTARY COURSE UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION CALICUT UNIVERSITY.P.O., MALAPPURAM, KERALA,

More information

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach Author: Jaewon Yang, Jure Leskovec 1 1 Venue: WSDM 2013 Presenter: Yupeng Gu 1 Stanford University 1 Background Community

More information

Statistics: revision

Statistics: revision NST 1B Experimental Psychology Statistics practical 5 Statistics: revision Rudolf Cardinal & Mike Aitken 29 / 30 April 2004 Department of Experimental Psychology University of Cambridge Handouts: Answers

More information

Bargaining, Information Networks and Interstate

Bargaining, Information Networks and Interstate Bargaining, Information Networks and Interstate Conflict Erik Gartzke Oliver Westerwinter UC, San Diego Department of Political Sciene egartzke@ucsd.edu European University Institute Department of Political

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

ColourMatrix: White Paper

ColourMatrix: White Paper ColourMatrix: White Paper Finding relationship gold in big data mines One of the most common user tasks when working with tabular data is identifying and quantifying correlations and associations. Fundamentally,

More information

Overlapping Communities

Overlapping Communities Overlapping Communities Davide Mottin HassoPlattner Institute Graph Mining course Winter Semester 2017 Acknowledgements Most of this lecture is taken from: http://web.stanford.edu/class/cs224w/slides GRAPH

More information

Approximate Inference

Approximate Inference Approximate Inference Simulation has a name: sampling Sampling is a hot topic in machine learning, and it s really simple Basic idea: Draw N samples from a sampling distribution S Compute an approximate

More information

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds Chapter 6 Logistic Regression In logistic regression, there is a categorical response variables, often coded 1=Yes and 0=No. Many important phenomena fit this framework. The patient survives the operation,

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models Features (Ising,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

6.207/14.15: Networks Lecture 12: Generalized Random Graphs 6.207/14.15: Networks Lecture 12: Generalized Random Graphs 1 Outline Small-world model Growing random networks Power-law degree distributions: Rich-Get-Richer effects Models: Uniform attachment model

More information

1 Searching the World Wide Web

1 Searching the World Wide Web Hubs and Authorities in a Hyperlinked Environment 1 Searching the World Wide Web Because diverse users each modify the link structure of the WWW within a relatively small scope by creating web-pages on

More information

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering Shuyang Ling Courant Institute of Mathematical Sciences, NYU Aug 13, 2018 Joint

More information

IENG581 Design and Analysis of Experiments INTRODUCTION

IENG581 Design and Analysis of Experiments INTRODUCTION Experimental Design IENG581 Design and Analysis of Experiments INTRODUCTION Experiments are performed by investigators in virtually all fields of inquiry, usually to discover something about a particular

More information

Chapter Three. Hypothesis Testing

Chapter Three. Hypothesis Testing 3.1 Introduction The final phase of analyzing data is to make a decision concerning a set of choices or options. Should I invest in stocks or bonds? Should a new product be marketed? Are my products being

More information

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,

More information

Unified Modeling of User Activities on Social Networking Sites

Unified Modeling of User Activities on Social Networking Sites Unified Modeling of User Activities on Social Networking Sites Himabindu Lakkaraju IBM Research - India Manyata Embassy Business Park Bangalore, Karnataka - 5645 klakkara@in.ibm.com Angshu Rai IBM Research

More information

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,

More information

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Eun Yong Kang Department

More information

Topic 21 Goodness of Fit

Topic 21 Goodness of Fit Topic 21 Goodness of Fit Contingency Tables 1 / 11 Introduction Two-way Table Smoking Habits The Hypothesis The Test Statistic Degrees of Freedom Outline 2 / 11 Introduction Contingency tables, also known

More information

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information

Personalized Social Recommendations Accurate or Private

Personalized Social Recommendations Accurate or Private Personalized Social Recommendations Accurate or Private Presented by: Lurye Jenny Paper by: Ashwin Machanavajjhala, Aleksandra Korolova, Atish Das Sarma Outline Introduction Motivation The model General

More information

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis /3/26 Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis The Philosophy of science: the scientific Method - from a Popperian perspective Philosophy

More information

RaRE: Social Rank Regulated Large-scale Network Embedding

RaRE: Social Rank Regulated Large-scale Network Embedding RaRE: Social Rank Regulated Large-scale Network Embedding Authors: Yupeng Gu 1, Yizhou Sun 1, Yanen Li 2, Yang Yang 3 04/26/2018 The Web Conference, 2018 1 University of California, Los Angeles 2 Snapchat

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

CSI 445/660 Part 3 (Networks and their Surrounding Contexts)

CSI 445/660 Part 3 (Networks and their Surrounding Contexts) CSI 445/660 Part 3 (Networks and their Surrounding Contexts) Ref: Chapter 4 of [Easley & Kleinberg]. 3 1 / 33 External Factors ffecting Network Evolution Homophily: basic principle: We tend to be similar

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Prentice Hall Science Explorer: Inside Earth 2005 Correlated to: New Jersey Core Curriculum Content Standards for Science (End of Grade 8)

Prentice Hall Science Explorer: Inside Earth 2005 Correlated to: New Jersey Core Curriculum Content Standards for Science (End of Grade 8) New Jersey Core Curriculum Content Standards for Science (End of Grade 8) STANDARD 5.1 (SCIENTIFIC PROCESSES) - all students will develop problem-solving, decision-making and inquiry skills, reflected

More information

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York

Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval. Sargur Srihari University at Buffalo The State University of New York Reductionist View: A Priori Algorithm and Vector-Space Text Retrieval Sargur Srihari University at Buffalo The State University of New York 1 A Priori Algorithm for Association Rule Learning Association

More information

CS Homework 3. October 15, 2009

CS Homework 3. October 15, 2009 CS 294 - Homework 3 October 15, 2009 If you have questions, contact Alexandre Bouchard (bouchard@cs.berkeley.edu) for part 1 and Alex Simma (asimma@eecs.berkeley.edu) for part 2. Also check the class website

More information

6.207/14.15: Networks Lecture 16: Cooperation and Trust in Networks

6.207/14.15: Networks Lecture 16: Cooperation and Trust in Networks 6.207/14.15: Networks Lecture 16: Cooperation and Trust in Networks Daron Acemoglu and Asu Ozdaglar MIT November 4, 2009 1 Introduction Outline The role of networks in cooperation A model of social norms

More information

Sampling Distributions

Sampling Distributions Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Remember sampling? Sampling Part 1 of definition Selecting a subset of the population to create a sample Generally random sampling

More information

ASA Section on Survey Research Methods

ASA Section on Survey Research Methods REGRESSION-BASED STATISTICAL MATCHING: RECENT DEVELOPMENTS Chris Moriarity, Fritz Scheuren Chris Moriarity, U.S. Government Accountability Office, 411 G Street NW, Washington, DC 20548 KEY WORDS: data

More information

Differentially Private Linear Regression

Differentially Private Linear Regression Differentially Private Linear Regression Christian Baehr August 5, 2017 Your abstract. Abstract 1 Introduction My research involved testing and implementing tools into the Harvard Privacy Tools Project

More information