Influence Maximization in Big Networks: An Incremental Algorithm for Streaming Subgraph Influence Spread Estimation

Size: px

Start display at page:

Download "Influence Maximization in Big Networks: An Incremental Algorithm for Streaming Subgraph Influence Spread Estimation"

Estella Gregory
6 years ago
Views:

1 Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015 Influence Maximization in Big etworks: An Incremental Algorithm for Streaming Subgraph Estimation Wei-Xue Lu, Peng Zhang, Chuan Zhou, Chunyi Liu, Li Gao Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China Center on Quantum Computation & Intelligent Systems, Univ. of Technology Sydney, Australia Abstract Influence maximization plays a key role in social network viral marketing. Although the problem has been widely studied, it is still challenging to estimate influence spread in big networks with hundreds of millions of nodes. Existing heuristic algorithms and greedy algorithms incur heavy computation cost in big networks and are incapable of processing dynamic network structures. In this paper, we propose an incremental algorithm for influence spread estimation in big networks. The incremental algorithm breaks down big networks into small subgraphs and continuously estimate influence spread on these subgraphs as data streams. The challenge of the incremental algorithm is that subgraphs derived from a big network are not independent and MC simulations on each subgraph (defined as snapshots may conflict with each other. In this paper, we assume that different combinations of MC simulations on subgraphs generate independent samples. In so doing, the incremental algorithm on streaming subgraphs can estimate influence spread with fewer simulations. Experimental results demonstrate the performance of the proposed algorithm. 1 Introduction Social networks is of vital importance in information diffusion and viral marketing. Influence maximization in social networks is defined as finding a subset of nodes that can trigger the largest number of users propagating the given information. The problem can be formulated as a discrete optimization problem under the independent cascade model (IC and the linear threshold model (LT [Kempe et al., 2003]. It has been proved P-hard, and a greedy algorithm is favored. However, heavy Monte-Carlo simulations are needed to estimate the influence spreads of different seed sets at each iteration. As social networks increase in size continuously, greedy algorithms are inapplicable of processing big networks. Many efforts have been made to explore scalable algorithms in big netowrks. An intuitive idea is to use heuristic algorithms, i.e. [Chen et al., 2009], [Chen et al., 2010], IRIE [Jung et al., 2011], Group- [Liu et al., 2014]. However, these algorithms are not robust with respect to network structures. On the other hand, many advanced greedy algorithms, i.e. [Leskovec et al., 2007], [Goyal et al., 2011], [Zhou et al., ], have been proposed to speedup seed selection. These methods focus on pruning unnecessary Monte-Carlo simulations in selecting new influential nodes. Unfortunately, these algorithms are still incapable of processing big networks with hundreds of millions of nodes because of unavoidable cost of simulating influence cascades. Moreover, all these greedy methods are designed on static networks and cannot be generalized to dynamic networks which commonly occur in reality. In this paper, we propose an incremental algorithm for estimating the expected influence spread of seed nodes and evaluate the performance under influence maximization. Specifically, the algorithm breaks down a big network into a large number of small subgraphs, and then processes these subgraphs as data streams. The results on subgraphs are joined to recover the global result in the whole network. The nature of incremental processing also enables it to handle dynamic networks where network structures change over time. The main challenge of the problem is that small subgraphs derived from a big network are not independent. For example, as in Figure 1, the two subgraphs on the right hand side share nodes 4, 5, 7 and even worse, two subgraphs may share common edges. In this case, MC simulation results on subgraphs (defined as snapshots may not be directly addable or even conflict with each other. In this work, we assume that edges between different subgraphs are disjoint and each subgraph is converted as a class of Strongly Connected Components (SCCs. SCCs on subgraphs are joined to form snapshots or SCCs of the whole network, and different combinations of simulations on subgraphs generate independent samples. In experiments, we demonstrate that the solution outperforms heuristics and existing MC simulation based methods. The merits of the proposed incremental algorithm are summarized as follows: A big network is broken down into subgraphs and the coin flip technique [Kempe et al., 2003] is applied to generate snapshots on each subgraph. Subgraphs can be processed in parallel. Snapshots on subgraphs are converted to be a class of 2076

2 Strongly Connected Components (SCCs. Snapshots on different subgraphs are joined to restore the global result of the original network. The number of snapshot joins increases polynomially with the number of subgraphs, which significantly reduces the complexity of MC simulations. Snapshots are incrementally maintained and updated such that replicated simulations can be avoided when changing seed sets. 2 Related Work Data intensive applications such as data streams motivate incremental learning algorithms. Incremental learning is advantageous when dealing with very large or non-stationary data. Existing incremental learning algorithms can be categorized into two groups: approximate incremental learning [Kivinen et al., 2004] and exact incremental learning [Laskov et al., 2006]. Typical examples include the incremental SVM [Poggio, 2001] and PCA models [Balsubramani et al., 2013]. However, these models were proposed to handle vectorial training data with the Identical and Independent Distributions (IID assumption. Incrementally learning from subgraph data without the IID assumption has not been addressed before. In the domain of network influence maximization, most existing models cannot handle big networks with hundreds of millions of nodes due to heavy Monte-Carlo computation. The typical algorithm Cost-Effective Lazy Forward ( [Leskovec et al., 2007] greatly reduce the number of influence spread estimations and is 700 times speed-up against previous greedy algorithms. Unfortunately, these improved greedy algorithms are still inefficient due to too many Monte- Carlo simulations for influence spread estimation. Static- GreedyDU [Cheng et al., 2013] reuses snapshots generated from a given graph and reduces the number of simulations by two orders of magnitude, meanwhile the performance is almost the same as the original greedy algorithm. Recent work [Ohsaka et al., 2014] includes a Pruned BFS method and community-based methods, i.e. [Wang et al., 2010], [Chen et al., 2014]. Unlike these work, our method scales well to big networks by breaking down the big network into small subgraphs for fast spread estimation. 3 Preliminaries Let G = (V, E be an undirected graph with a node set V of size V and an edge set E of size E. Without loss of generality, we use the independent cascade (IC model as the propagation model. A node subset S V is called a seed set if we first activate nodes in S. Given a seed set S, its influence spread σ(s is defined as the expected number of nodes eventually activated. Formally, the influence maximization problem is to find the optimal seed set S such that S = argmax S V ; S =k σ(s, where S stands for the number of elements in S, and k is a predefined integer. To solve the influence estimation problem, one needs to estimate σ(s for any given S. However, exact computation of σ( is #P-hard [Chen et al., 2010], so in practice Monte- Carlo simulations are employed to estimate σ(s: Figure 1: An illustration of the network decomposition. Propagation Simulation. The influence spread is obtained by directly simulating the random process propagating from a given seed set S. Let A i denote the set of nodes newly activated in the i-th iteration and we have A 0 = S. In the i + 1-th iteration, each node u A i has a single chance to activate its each inactive neighbor v with probability p uv. If v is activated, it is added to A i+1. The process stops when A T +1 = the first time for some T. And X(S = A 0 A 1 A T is the influence spread of this single simulation. We run such simulations for many times and obtain the estimated expected influence spread σ(s by averaging over all simulations. Snapshot Simulation. Based on coin flip [Kempe et al., 2003], we flip coins in advance to produce a snapshot X = (V, E, which is a sampled subgraph of G, where an edge uv is either kept with the probability p uv or removed, such that E E. Thus, in this specific snapshot X the influence spread for the given seed set S, denoted as X(S, equals to the number of nodes reachable from S. We generate a large number of snapshots and obtain the estimated expected influence spread σ(s by averaging over all the generated snapshots. The above two simulation methods are equivalent, because X(S and X(S share the same probability distribution [Kempe et al., 2003]. In this work, we use snapshot simulations because we need to frequently estimate the influence spread of different seed sets when choosing new seeds to S. 4 Problem Formulation The basic idea of the paper is to break down a big network into small subgraphs and analyze the reachability of nodes on each subgraph independently. The results from each subgraph are joined to restore the results on the original network. We use Figure 1 to explain the incremental subgraph algorithm. Figure 1(a is the original network that each edge is associated with a propagation probability p uv, u, v {1, 2,, 8}. For simplicity, assume the edges in the graph are split into two disjoint sets and two subgraphs are obtained as shown in Figures 1(b and (c. The propagation probability p uv on each edge in the subgraphs is inherited from the original network. This way, the original graph is broken down into two small and tractable subgraphs and incremental learning can be applied to the subgraphs. Consider that the original network G = (V, E is broken down into subgraphs G 1 = (V 1, E 1, G 2 = 2077

3 (V 2, E 2,, G = (V, E with V 1 V 2 V = V, E 1 E 2 E = E, and E i E j =, 1 i, j. In dynamic networks with increasing nodes and edges, we keep increasing nodes and edges as new streaming subgraphs denoted by G +1 = (V +1, E +1,. Then, we use the coin flip method on each subgraph G i for M times and generate M snapshots. Denote these snapshots on G i as X (i 1, X(i 2,, X(i M. Then, our problem turns to given a seed set in a big network, incrementally learning snapshots generated from subgraphs to estimate the influence spread of the seed set in the original big network. 5 The Incremental Algorithm 5.1 Expression of Influence First, each snapshot X r of the original network can be represented by a m-dimensional binary vector r = (r uv uv E, where r uv = 1 if edge uv is activated in a coin flipping; otherwise, r uv = 0. We call r a representation vector of X r. The closed set of all possible snapshot representation vectors is denoted by R as follows, R = {r r = (r uv uv E {0, 1} E } (1 Under the IC model assumption, the probability distribution on R is given by q(r = (p uv ruv (1 p uv 1 ruv, r R (2 uv E If the original graph is broken down into subgraphs, the snapshot we get by restricting X r on subgraph G i is denoted by X r (i = X r Gi with the representation vector r (i = (r uv uv Ei. Obviously, the marginal distribution of r (i is q(r (i = uv E i (p uv ruv (1 p uv 1 ruv, which similarly describes the distribution of snapshots on subgraph G i. These distributions satisfy the following equation q(r = q(r (1 q(r (2 q(r (. Therefore, subgraphs can be processed incrementally by using the network decomposition. We introduce the symbol as the join operator in the equation X r = X r (1 X r (2 X r (, which means that X r can be broken down into snapshots X r (i, 1 i, where X r (i is the snapshot corresponding to subgraph G i. Conversely, if snapshots X r (i, 1 i are increasingly obtained from subgraphs, we can join them to restore the snapshot X r of the original network. Hence the equation X r = X r (1 X r (2 X r ( interprets the relationship between X r and X r (i. Let (S; X r denote the set of nodes reachable from the seed set S in a specific snapshot X r. The expected influence spread σ(s of the seed set S can be described as follows, σ(s = q(r (1 q(r (2 q(r ( (S; X r (3 r (1 r (2 r ( We can observe from the above equation that σ(s is determined by (S; X r = (S; X r (1 X r (2 X r (. Thus, we explain how to use the snapshots generated from these subgraphs to estimate the expected influence spread of a given seed set in the original network. Algorithm 1: The join of SCC classes. Input: C 1 = {C 11, C 12,, C 1α1 } and C 2 = {C 21, C 22,, C 2α2 } Output: A class of SCCs after joining of C 1 and C 2 1 function join(c 1, C 2 2 for j = 1, 2,, C 2 = α 2 do 3 s = 1, C, C ; 4 for i = 1, 2,, C 1 do 5 if C 1i C 2j then 6 if C 2j C 1i then return C 1 ; 7 else C C C 1i ; 8 else 9 C s C 1i ; C {C, C s }; s s C {C, C}; C 1 C 11 return C 1 For any snapshot X in the network G = (V, E, we have u v if node v is reachable from node u on the snapshot X. We say the two nodes u and v are equivalent (u v if and only if both u v and v u are satisfied. Also, we define that each node u is equivalent to itself, u u. Based on the definitions, a set of nodes C = C(u; X = {v v u on X, v V } is called a Strongly Connected Component (SCC of X represented by the node u. The set C is either a node {u} or equivalent nodes. This way, the snapshot X can be described by a class of SCCs C = {C(u; X, u E} = {C 1, C 2, }, where C 1, C 2, are disjoint sets, and C 1 C 2 = E. Also, SCCs are used for memory-saving and efficient calculation. As we aim to find the most influential nodes, single sets in class C can be removed. Then, C = {C 1, C 2,, C α } with C i 2, i {1, 2,, α}. And the elements of (S; X r can be described as follows, (S; X r = u S C(u; X r u S C(u;X r 2 C(u; X (1 r X ( r Based on the above equation, we join SCCs on subgraphs to restore the result on the original network. 5.2 Join of the SCC Classes Consider that we have transformed all the snapshot X (i j, 1 i, 1 j M generated from subgraphs into classes of SCCs C (i j = {C (i jk, 1 k α(i j, C(i jk 2}, where C(i j is a SCC class of X (i j. We have discussed that if the snapshots have been incrementally obtained on each subgraph, we can join them to obtain snapshots of the original network. ow, the problem turns to using the SCC classes of these subgraphs to restore SCC classes of the original network. What we need to do next is to join the SCC classes of these subgraphs. We take = 4 for example. Suppose we incrementally have C (1, C (2, C (3 j 3, C (4 j 4, where each C (i j i is a SCC class of the subgraph G i. In order to restore a SCC class of the original network, we can first join C (1, C (2 to obtain C (1,2,2 which is a SCC class of the subgraph G 1 G 2 (V 1 V 2, E 1 E 2, then C (3 j 3 arrives and we increasingly join it with C (1,2,2 to ob- 2078

4 tain C (1,2,3,2,3. At last, C (4 j 4 arrives and we join it with C (1,2,3,2,3 to obtain C (1,2,3,4,2,3,4, which is a SCC class of the original network. From this example, we can observe that the join of C (1, C (2,, C ( j, 1,, j M can be reduced to the join of any two SCC classes. Therefore, we define a function join(c 1, C 2 in Algorithm 1 to return the class after joining C 1 = {C 11, C 12,, C 1α1 } and C 2 = {C 21, C 22,, C 2α2 }. 5.3 Influence Estimation In this section, we first briefly introduce the traditional network influence estimation methods theoretically guaranteed by Theorem 1. When extending the traditional methods into streaming subgraphs, we can join the snapshots of subgraphs to restore samples of the original network. However, the join of subgraph snapshots exponentially increases with the number of subgraphs. So, we propose to select only a few number of snapshot joins. Proposition 3 explains that when the selected joins are independent and identically distributed (i.i.d., the estimation converges fast. However, it is hard to select i.i.d. snapshot joins across subgraphs in reality. Therefore, we propose Definitions 1 and 2 to choose snapshot joins that approximately follow the i.i.d. condition. Proposition 3 shows that by using the approximation, we can use a polynomial number of snapshot joins to restore the global result. For simplicity, (S; X (1 X (2 X ( j is denoted by g j1,,,j (S, which means that the snapshot of the original network can be obtained by incrementally joining the j i -th snapshot from the i-th subgraph, and g j1,,,j (S returns the influence spread of a given seed set S. The representation vector of X (1 X (2 X ( j follows the distribution q(r. Hence for any 1,, j M, g j1,,,j (S is a random variable such that Eg j1,,,j (S = σ(s, where E is the expectation operator. Generally, M simulations are independently employed on the original network, generating M snapshots denoted by X i = X (1 i X ( i, and σ(s is empirically estimated by the following equation σ(s = 1 M M g i,i,,i(s where g i,i,,i (S, 1 i M are bounded and i.i.d.. So the estimation is guaranteed by the following Theorem [Hoeffding, 1963], Theorem 1 (Hoeffding s inequality. Let Y 1,, Y M be independent variables span over the interval [a, b], and S M Y Y M. Then, for all ε > 0, P ( 1 SM ESM ε M 2 exp ( 2Mε2 (b a 2 In the traditional problem setting, components of the subscript of g i,i,,i (S are the same, because simulations are employed on a single network G and subscript (i, i,, i denotes the i-th snapshot simulation on G. Furthermore, snapshots on subgraphs are not obtained by independent simulations but by restricting snapshots of the original network on subgraphs. In contrast, our incremental algorithm considers that snapshots of subgraphs are generated independently and then joined to the previous results. Specifically, we first generate M snapshots X (1, 1 (4 M on subgraph G 1, then the subgraph G 2 arrives where we generate another M snapshots X (2, 1 M. We join them to the snapshots of G 1 and obtain snapshots X (1 X (2. The process continues until the -th subgraph is joined. This way, we generate M random variables g j1,,,j (S, 1,,, j M such that Eg j1,,,j (S = σ(s. Unfortunately, it is not reasonable to take S Σ 1/M M j M 1=1 j =1 g,,,j (S as an estimation of σ(s. On the one hand, the number of snapshot joins M exponentially increases with respect to the number of subgraphs. On the other hand, many snapshot joins are redundant when estimating the expected influence spread. To better explain these two points, we first rewrite S Σ as the average summation of M 1 groups such that each group is the average summation of M independent and identically distributed random variables, i.e., S Σ = 1/M 1 M 1 β=1 [ 1/M M α=1 g j (α,β 1,,j (α,β ] (S 1/M 1 M 1 β=1 T β, where g (α,β,,j (α,β (S, 1 α M in T β are i.i.d.. Figure 2(a shows the join of three snapshots of the original network and the first join snapshot is obtained by incrementally join X (1 1, X(2 1, X(3 1, X(4 1. This join snapshot corresponds to the random variable g 1,1,1,1 (S. Similarly, the rest two corresponding random variables are g 2,2,2,2 (S and g 3,3,3,3 (S. It is clear that these three are i.i.d. and the average summation is T 1 = 1/3 [g 1,1,1,1 (S + g 2,2,2,2 (S + g 3,3,3,3 (S]. Hence, Figure 2(a corresponds to T 1. Similarly, Figure 2(b, 2(c, 2(d correspond to T 2, T 3, T 4 respectively. ote that not all T j are shown in the figure. Each T j can be treated as some S M in Theorem 1 and Eq. (4 is satisfied, i.e., P ( T j σ(s ε ( 2 exp. 2Mε2 Therefore, each T (b a 2 j has the same convergence rate as the traditional method. ext, our goal is to choose a portion of T j to estimate the influence spread in order to obtain higher convergence rate. If the same convergence rate is required, our method needs fewer simulations on each subgraph than the traditional method. Moreover, our method can deal with dynamic networks. For this goal, we first introduce the following proposition. Proposition 2. If T 1,, T k are independent and identically distributed random variables, then 1/k k T i has a higher convergence rate than the righthand side of Eq. (4. Proof. By applying the independence and Hoeffding s lemma [Hoeffding, 1963], we have, for any t > 0 ( ( 1 k P T i σ(s ε = P t 1 k T i tσ(s tε k k ( E exp t 1 k k T i e tσ(s e tε = Ee k t T i e tσ(s e tε k k = exp ( t exp k Ti ( 1 exp 8M ( 1 t 2 8M k (b a2 tε t 2 (b a2 e tσ(s e tε k2 ( 2 kmε 2 exp (b a 2 = 2079

5 (a Join group 1 (b Join group 2 (c Join group 3 (d Join group 4 Figure 2: Joins of subgraph snapshots. X (i j represents the j th snapshot simulation result on the i th subgraph. The first inequality is based on the Chebyshev s inequality. The second equality is guaranteed by the independence assumption of T 1,, T k. The second inequality is due to the Hoeffding s lemma. The last inequality is the upper bound of the previous term after maximization with respect to t. The last item of the above equation shows that a higher convergence rate is obtained. From the above proposition, the strategy is to choose those T j that tend to be independent. Besides, the influence spread of a particular seed set is not only dependent on different snapshots X (i j of subgraphs but also on different joins of them, which determines the network topology of the snapshot of the original graph. Hence, we define (,, j as a join path with 1 stages (,, (, j 3,, (j 1, j. Definition 1. We call two join paths (i 1, i 2,, i and (,,, j are orthogonal if there is no overlapping stage, i.e., (i α, i α+1 (j α, j α+1 for all 1 α 1. Definition 2. Let T β1 = 1/M M α=1 g j (α,β 1 1,,j (α,β 1 and T β2 = 1/M M α =1 g j (α,β 2 1,,j (α,β 2.T β1 and T β2 are called join-independent if for any given α and α such that 1 α, α M, (j (α,β 1 1,, j (α,β 1 and (j (α,β 2 1,, j (α,β 2 are orthogonal. For example, in Figure 2, T 1, T 2, T 3 are pairwise joinindependent and there are paths in T 1 and T 4 with two overlapping stages. ow, we introduce the set T = {T β T β1, T β2 are join-independent, T β1, T β2 T }. We use T β T to estimate the influence spread. The following proposition shows how many T β are selected. Proposition 3. Let T be the set of T β that are join-dependent pairwise, i.e., T = {T β T β1, T β2 are join-independent, T β1, T β2 T }, then we have T = M. Proof. All the paths in T β T form a set of paths P. Each candidate T corresponds to a unique P. All candidates T (or P have the same cardinality. It can be verified that the set P 0 = {(i, j, i, j,, 1 i, j M} is a portion of candidates P. So, we have P = P 0 = M 2, which means T = M for T β T with M paths. Algorithm 2: The Subgraph Incremental Algorithm. Input: G = (V, E, G 1 = (V 1, E 1,, G = (V, E, k, R, propagation probability p Output: the optimal seed set S that maximizes the influence on the original network 1 for i = 1, 2,, do 2 for j = 1, 2,, M do 3 sample each edge with probability p on the i-th subgraph to obtain a snapshot X (i j ; 4 use the DFS graph search to get a class of SCCs C (i j ; 5 for i = 1, 2,, M do 6 for j = 1, 2,, M do 7 calculate C ij using the function join(, 8 S ; 9 while S < k do 10 for v V \S do 11 σ(v S 0; 12 for i = 1, 2,, M do 13 for j = 1, 2,, M do 14 if v S then return σ(v S; 15 else σ(v S σ(v S + C ij (v ; 16 t argmax v V \S 1 M 2 σ(v S; 17 S S {t} 18 return S Therefore, we regard σ(s = 1/M T β T T β as the estimation of the influence spread given the seed set S on the original graph. Though there may be many such T, from the proof of Proposition 3 it is easily can be seen that there exists some T 0 such that the set all the paths in T β T 0 equals P 0. Hence one specific expression of σ(s can be given by σ(s = 1/M M M j=1 g i,j,i,j, (S, where a polynomial number M 2 of snapshot joins are used. 5.4 Influence Maximization In this section, we evaluate our estimation of the influence spread in the process of influence maximization. Greedy algorithm [Kempe et al., 2003] is adopted to select seed nodes, starting with an empty seed set S, and then add the node u with the maximum marginal influence in every iteration, i.e., u = argmax v V \S σ(s v σ(s, into S until the threshold k ( S = k is met. Because the influence spread function is non-negative, monotone and submodular, the accuracy of the optimum solution is guaranteed by f(s (1 1/ef(S, where S is the optimal solution [emhauser et al., 1978]]. We summarize the above in Algorithm 2. ote that σ(s is used in each iteration when the marginal influence is calculated σ(s v σ(s = 1/M 2 M M j=1 [g i,j,i,j, (S v g i,j,i,j, (S] = 1/M 2 M (S; X (1 M j=1 i X (2 j { (S v; X(1 i X (2 j } = 1/M 2 M M j=1 [1 I(S v] C i,j (v, where I(S v = 1 if s S such that s v, C i,j (v denotes the SCC including v on the snapshot 2080

6 Datasets number of nodes number of edges ego-facebook 4, ,468 ca-hepph 12, ,521 ca-condmat 23, ,936 -enron 36, ,662 Table 1: The four real-world datasets. X (1 i X (2 j whose class of SCCs is denoted by C ij. 6 Experiments We conduct experiments on four real-world data sets. All algorithms are implemented in C++ and run on a Windows 7 system with 2.3GHz CPU and 4GB memory. 6.1 Data Sets We use four network datasets from snap.stanford.edu/data/ for testing and comparisons. The number of nodes and edges in each network is summarized in Table 1. ego-facebook: This data set consists of circles (or friends lists from Facebook. It was collected from survey participants using the Facebook app. ca-hepph and ca-condmat: These two data sets are obtained from the e-print arxiv. If authors i and j publish a paper together, the network contains an edge connecting nodes i to j. -enron: odes of the network are addresses and if an address i sent at least one to address j, the graph contains an undirected edge from i to j. 6.2 Benchmark Methods We compare the incremental algorithm with five algorithms. The propagation probability of each edge is set to p = p uv = 0.01, uv E. : We set M = 10, 000 in the Monte-Carlo simulations to estimate the spread of a seed set. : reuses the generated snapshots to reduce the number of simulations. The number of snapshots is set to M = 10, 000. : This algorithm is developed for the IC model with the uniform propagation probability. : It uses maximization paths for influence spread estimation. The value of the parameter θ is set to 1/320. : We implement the power method with a damping factor of 0.85 and set the k highest-ranked nodes as seeds. The algorithm stops by the threshold value of 10 6 with the L 1 -norm. 6.3 Experimental Results In our experiments, to obtain the influence spread of the heuristic algorithms for each seed set, we run 10,000 simulations and take the average spread as the influence spread, which matches the accuracy of greedy algorithms (a ego-facebook (c ca-condmat (b ca-hepph (d -enron Figure 3: Comparisons between benchmark algorithms and the incremental algorithm on the four data sets. Running time (in msec ego facebook ca CondMat ca HepPh Enron Figure 4: Comparisons w.r.t. the running time on the four data sets Influence spread. Figure 3 shows the influence spreads of different algorithms on the real-world networks with respect to the seed set size k. k increases from 1 to 50. From the figure, we can observe that our method performs the best in all settings. Together with and, our method is also independent of the network structures. Moreover, only M = 100 snapshots on each subgraph are generated to estimate the influence spreads in the process of selecting the most influential nodes. The results show much fewer MC simulations than the other greedy methods. Running time. Figure 4 shows the running time under the optimal seed set of size 50 on the four datasets. Undoubtedly, heuristic methods, including DigreeDiscount, and, are the fastest. is the slowest because too many MC simulations are employed during the estimation of influence spreads. We can observe that the speed of our algorithm is as fast as the algorithm which 2081

7 is a popularly used scalable algorithm. From these result, we can conclude that our algorithm is robust to network structures, performs the best compared with existing algorithms, and scale well to large networks. 7 Conclusions In this paper, we presented an incremental algorithm for solving the big network influence maximization problem. Our idea is to break down the original graph into small subgraphs which can be sequentially processed as subgraph streams. Experimental results show that the algorithm is robust to network structures, performs the best compared with existing algorithms, and scale well to large networks. Acknowledgments This work was supported by the SFC (o , and the Strategic Leading Science and Technology Projects of CAS (o.xda , 973 project (o. 2013CB and Australia ARC Discovery Project (DP References [Balsubramani et al., 2013] Akshay Balsubramani, Sanjoy Dasgupta, and Yoav Freund. The fast convergence of incremental pca. In Advances in eural Information Processing Systems, pages , [Chen et al., 2009] Wei Chen, Yajun Wang, and Siyu Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [Chen et al., 2010] Wei Chen, Chi Wang, and Yajun Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [Chen et al., 2014] Yi-Cheng Chen, Wen-Yuan Zhu, Wen- Chih Peng, Wang-Chien Lee, and Suh-Yin Lee. Cim: Community-based influence maximization in social networks. ACM Trans. Intell. Syst. Technol., 5(2:25:1 25:31, April [Cheng et al., 2013] Suqi Cheng, Huawei Shen, Junming Huang, Guoqing Zhang, and Xueqi Cheng. Staticgreedy: solving the scalability-accuracy dilemma in influence maximization. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages ACM, [Goyal et al., 2011] Amit Goyal, Wei Lu, and Laks VS Lakshmanan. Celf++: optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pages ACM, [Hoeffding, 1963] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301:13 30, [Jung et al., 2011] Kyomin Jung, Wooram Heo, and Wei Chen. Irie: Scalable and robust influence maximization in social networks. arxiv preprint arxiv: , [Kempe et al., 2003] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [Kivinen et al., 2004] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels. Signal Processing, IEEE Transactions on, 52(8: , [Laskov et al., 2006] Pavel Laskov, Christian Gehl, Stefan Krüger, and Klaus-Robert Müller. Incremental support vector learning: Analysis, implementation and applications. The Journal of Machine Learning Research, 7: , [Leskovec et al., 2007] Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and atalie Glance. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [Liu et al., 2014] Qi Liu, Biao Xiang, Enhong Chen, Hui Xiong, Fangshuang Tang, and Jeffrey Xu Yu. Influence maximization over large-scale social networks: A bounded linear approach. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages ACM, [emhauser et al., 1978] George L emhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functionsłi. Mathematical Programming, 14(1: , [Ohsaka et al., 2014] aoto Ohsaka, Takuya Akiba, Yuichi Yoshida, and Ken-ichi Kawarabayashi. Fast and accurate influence maximization on large networks with pruned monte-carlo simulations [Poggio, 2001] Gert CauwenberghsTomaso Poggio. Incremental and decremental support vector machine learning. In Advances in eural Information Processing Systems 13: Proceedings of the 2000 Conference, volume 13, page 409. MIT Press, [Wang et al., 2010] Yu Wang, Gao Cong, Guojie Song, and Kunqing Xie. Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages ACM, [Zhou et al., ] Chuan Zhou, Peng Zhang, Wenyu Zang, and Li Guo. On the upper bounds of spread for greedy algorithms in social network influence maximization. IEEE Transactions on Knowledge and Data Engineering, no. 1, pp. 1, PrePrints PrePrints, doi: /tkde

Maximizing the Coverage of Information Propagation in Social Networks

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 215) Maximizing the Coverage of Information Propagation in Social Networks Zhefeng Wang, Enhong Chen, Qi