arxiv: v1 [cs.si] 25 May 2016

Size: px

Start display at page:

Download "arxiv: v1 [cs.si] 25 May 2016"

Blake Sims
6 years ago
Views:

1 Stop--Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks Hung T. Nguyen CS Department Virginia Commonwealth Univ. Richmond, VA, USA My T. Thai CISE Department University of Florida Gainesville, Florida, USA Thang N. Dinh CS Department Virginia Commonwealth Univ. Richmond, VA, USA arxiv: v1 [cs.si] 5 May 016 ABSTRACT Influence Maximization (IM), that seeks a small set of key users who spread the influence widely into the network, is a core prolem in multiple domains. It finds applications in viral marketing, epidemic control, assessing cascading failures within complex systems. Despite the huge amount of effort, IM in illion-scale networks such as Faceook, Twitter, World Wide We has not een satisfactorily solved. Even the state-of-the-art methods such as TIM+ IMM may take days on those networks. In this paper, we propose D-, two novel sampling frameworks for IM-ased viral marketing prolems. D- are up to 100 times faster than the SIG- MOD 15 est method, IMM, while providing the same (1 1/e ɛ) approximation guarantee. Underlying our frameworks is an innovative Stop--Stare strategy in which they stop at exponential check points to verify (stare) if there is adequate statistical evidence on the solution quality. Theoretically, we prove that D- are the first approximation algorithms that use (asymptotically) minimum numers of samples, meeting strict theoretical thresholds characterized for IM. The asolute superiority of D- are confirmed through extensive experiments on real network data for IM another topic-aware viral marketing prolem, named TVM. Keywords Influence Maximization; Stop--Stare; Sampling 1. INTRODUCTION Viral Marketing, in which r-awareness information is widely spread via the word-of-mouth effect, has emerged as one of the most effective marketing channels. It is ecoming even more attractive with the explosion of social networking services such as Faceook 1 with 1.5 illion monthly active users or Instagram with more than 3.5 illion daily like connections. To create a successful viral marketing campaign, one needs to seed the content with a set of individuals with high social networking influence. Finding such a set of users is known as the Influence Maximization prolem. Given a network a udget k, Influence Maximization (IM) asks for k influential users who can spread the influence widely into the network. Kempe et al. [1] were the first to formulate IM as a cominatorial optimization prolem on the two pioneering diffusion models, namely, Independent Cascade (IC) Linear Threshold (LT). They prove IM to e NP-hard provide a natural greedy algorithm that yields (1 1/e ɛ)-approximate solutions for any ɛ > 0. This celerated work has motivated a vast amount of work on IM in the past decade [, 3, 4, 5, 6, 7, 8, 9, 10, 11, 1]. However, most of the existing methods either too slow for illion-scale networks [1,, 4, 5, 6, 7] or ad-hoc heuristics without performance guarantees [13, 3, 14, 15]. The most scalale methods with performance guarantee for IM are TIM/TIM+[8] latter IMM[16]. They utilize a novel RIS sampling technique introduced y Borgs et al. in [17]. All these methods attempt to generate a (1 1/e ɛ) approximate solution with minimal numers of RIS samples. They use highly sophisticated estimating methods to make the numer of RIS samples close to some theoretical thresholds θ [8, 16]. However, they all share two shortcomings: 1) the numer of generated samples can e aritrarily larger than θ, ) the thresholds θ are not shown to e the minimum among their kinds. In this paper, we 1) unify the approaches in [17, 8, 16] to characterize the necessary numer of RIS samples to achieve (1 1/e ɛ)-approximation guarantee; ) design two novel sampling algorithms D- aiming towards achieving minimum numer of RIS samples. In the first part, we egin with defining RIS framework which consists of two necessary conditions to achieve the (1 1/e ɛ) factor classes of RIS thresholds on the sufficient numers of RIS samples, generalizing θ thresholds in [8, 16]. The minimum threshold in each class is then termed type- 1 minimum threshold, the minimum among all type-1 minimum thresholds is termed type- minimum threshold. In the second part, we develop the Stop--Stare Algorithm () its dynamic version D- that guarantee to achieve, within constant factors, the two minimum thresholds, respectively. Both D- follow the stop--stare strategy which can e efficiently applied to many optimization prolems over the samples guarantee some constant times the minimum numer of samples required. In short, the algorithms keep generating samples stop at exponential check points to verify (stare) if there is adequate statistical evidence on the solution quality for termination. This strategy will e shown to address oth of the shortcomings in [8, 16]: 1) guarantee to e close to the theoretical thresholds ) the thresholds are minimal y definitions. The dynamic algorithm, D-, improves over y automatically dynamically selecting the est parameters for the RIS framework. We note that

2 the Stop--Stare strategy comined with RIS framework enales D- to meet the minimum thresholds without explicitly computing/looking for these thresholds. That is in contrast to previous approaches [17, 8, 16] which all find some explicit unreachale thresholds then proe for them with unounded or huge gaps. Our experiments show that oth D- outperform the est existing methods up to several orders of magnitudes w.r.t running time while returning comparale seed set quality. More specifically, on Friendster network with roughly 65.6 million nodes 1.8 illion edges, D-, taking 3.5 seconds when k = 500, are up to 100 times faster than IMM. We also run CELF++ (the fastest greedy algorithm for IM with guarantees) on Twitter network with k = 1000 oserve that D- is 10 9 times faster. Our contriutions are summarized as follows. We generalize the RIS sampling methods in [17, 8, 16] into a general framework which characterizes the necessary conditions to guarantee the (1 1/e ɛ)- approximation factor. Based on the framework, we define classes of RIS thresholds two types of minimum thresholds, namely, type-1 type-. We propose the Stop--Stare Algorithm () its dynamic version, D-, which oth guarantee a (1 1/e ɛ)-approximate solution are the first algorithms to achieve, within constant factors, the type- 1 type- minimum thresholds, respectively. Our proposed methods are not limited to solve influence maximization prolem ut also can e generalized for an important class of hard optimization prolems over samples/sketches. Our framework approaches are generic can e applied in principle to sample-ased optimization prolems to design high-confidence approximation algorithm using (asymptotically) minimum numer of samples. We carry extensive experiments on various real networks with up to several illion edges to show the superiority in performance comparale solution quality. To test the applicaility of the proposed algorithms, we apply our methods on an IM-application, namely, Targeted Viral Marketing (TVM). The results show that our algorithms are up to 100 times faster than the current est method on IM prolem, for TVM, the speedup is up to 500 times. Note that this paper does not focus on distriuted/parallel computation, however our algorithms are amenale to a distriuted implementation which is one of our future works. Related works. Kempe et al. [1] formulated the influence maximization prolem as an optimization prolem. They show the prolem to e NP-complete devise an (1 1/e ɛ) greedy algorithm. Later, computing the exact influence is shown to e #P-hard [3]. Leskovec et al. [] study the influence propagation in a different perspective in which they aim to find a set of nodes in networks to detect the spread of virus as soon as possile. They improve the simple greedy method with the lazy-forward heuristic (CELF), which is originally proposed to optimize sumodular functions in [18], otaining an (up to) 700-fold speedup. Several heuristics are developed to find solutions in large networks. While those heuristics are often faster in practice, they fail to retain the (1 1/e ɛ)-approximation guarantee produce lower quality seed sets. Chen et al. [19] otain a speedup y using an influence estimation for the IC model. For the LT model, Chen et al. [3] propose to use local directed acyclic graphs (LDAG) to approximate the influence regions of nodes. In a complement direction, there are recent works on learning the parameters of influence propagation models [0, 1]. The influence maximization is also studied in other diffusion models including the majority threshold model [] or when oth positive negative influences are considered [3] when the propagation terminates after a predefined time [, 4]. Recently, IM across multiple OSNs have een studied in [11] [5] studies the IM prolem on continuous-time diffusion models. Recently, Borgs et al. [17] make a theoretical reakthrough present an O(kl (m+n) log n/ɛ 3 ) time algorithm for IM under IC model. Their algorithm (RIS) returns a (1 1/e ɛ)-approximate solution with proaility at least 1 n l. In practice, the proposed algorithm is, however, less than satisfactory due to the rather large hidden constants. In sequential works, Tang et al. [8, 16] reduce the running time to O((k + l)(m + n) log n/ɛ ) show that their algorithm is also very efficient in large networks with illions of edges. Nevertheless, Tang s algorithms have two weaknesses: 1) intractale estimation of maximum influence ) taking union ounds over all possile seed sets in order to guarantee a single returned set. Organization. The rest of the paper is organized as follows: In Section, we introduce two fundamental models, i.e., LT IC, the IM prolem definition. We, susequently, devise the unified RIS framework, RIS threshold two types of RIS minimum thresholds in Section 3. Section 4 5 will present the algorithm prove the approximation factor as well as the achievement of type-1 minimum threshold. In Section 6, we propose the dynamic algorithm, D- prove the approximation together with type- minimum threshold property. Finally, we show experimental results in Section 7 draw some conclusion in Section 8.. MODELS AND PROBLEM DEFINITION This section will formally define two most essential propagation models, e.i., Linear Threshold (LT) Independent Cascade (IC), that we consider in this work followed y the prolem statement of the Influence Maximization (IM). We astract a network using a weighted graph G = (V, E, w) with V = n nodes E = m directed edges. Each edge (u, v) E is associated with a weight w(u, v) [0, 1] which indicates the proaility that u influences v..1 Propagation Models In this paper, we study two fundamental diffusion models, namely, Linear Threshold (LT) Independent Cascade (IC). Assume that we have a set of seed nodes S, the propagation processes under these two models happen in rounds. At round 0, all nodes in S are activated the others are not activated. In the susequent rounds, the newly activated nodes will try to activate their neighors. Once a node v ecomes active, it will remain active till the end. The process stops when no more nodes get activated. The distinctions of the two models are descried as follows: Linear Threshold (LT) model. The edge weights in LT

model must satisfy the condition u V w(u, v) 1. At the eginning of the propagation process, each node v selects a rom threshold λ v uniformly at rom in range [0, 1].

Let I(S) denote the expected numer of activated nodes given the seed set S, where the expectation is taken over all λ v values from their uniform distriution.

In IC, when a node u gets activated, initially or y another node, it has a single chance to activate each inactive neighor v with the proaility proportional to the edge weight w(u, v).

3 model must satisfy the condition u V w(u, v) 1. At the eginning of the propagation process, each node v selects a rom threshold λ v uniformly at rom in range [0, 1]. In round t 1, an inactivated node v ecomes activated if activated neighors u w(u, v) λv. Let I(S) denote the expected numer of activated nodes given the seed set S, where the expectation is taken over all λ v values from their uniform distriution. We call I(S) the influence spread of S under the LT model. Independent Cascade (IC) model. In IC, when a node u gets activated, initially or y another node, it has a single chance to activate each inactive neighor v with the proaility proportional to the edge weight w(u, v). After that moment, the activated nodes remain its active state ut they have no contriution in later activations. Notation Tale 1: Tale of Symols Description n, m #nodes, #edges of graph G = (V, E, w). I(S) Influence Spread of seed set S V. The maximum I(S) for any size-k seed set S. Ŝ k The returned size-k seed set of /D-. Sk An optimal size-k seed set, i.e., I(Sk ) =. R j A rom RR set. R A set of rom RR sets. Cov R (S), #RR sets in R incident at some node in S. S V c c = (e ). Υ(ɛ, δ) Υ(ɛ, δ) = c ln 1 1 δ ɛ. Λ 1 Λ 1 = (1 + ɛ 1 )(1 + ɛ )Υ(ɛ 3, δ/3). Λ Λ = (1 + ɛ )Υ(ɛ, δ ).. Prolem Definition Given the propagation models defined previously, we formally state the Influence Maximization (IM) prolem as in the following definition, Definition 1 (Influence Maximization (IM)). Given a graph G = (V, E, w), k Z + a propagation model, the Influence Maximization prolem asks for a seed set Ŝk V of k nodes that maximizes its influence spread, I(Ŝk). 3. UNIFIED RIS FRAMEWORK This section will present the unified RIS framework which generalizes the methods of using RIS sampling for IM prolem. The unified framework characterizes the sufficient conditions to guarantee an (1 1/e ɛ)-approximation in the framework. Susequently, we will introduce the concept of RIS threshold in terms of the numer of necessary samples to guarantee the solution quality two types of minimum RIS thresholds, i.e., type-1 type Preliminaries RIS sampling The major ottle-neck in the traditional methods for IM [1,, 4, 6] is the inefficiency in estimating the influence spread. To address that, Borgs et al. [17] introduced a novel sampling approach for IM, called Reverse Influence Sampling (in short, RIS), which is the foundation for TIM/TIM+[8] IMM[16], the state-of-the-art methods. a 0.3 c Generate a collection of rom RR sets, d R = R 1 =, a R = d, c, a, R 3 = Figure 1: An example of generating rom RR sets under the LT model. Three rom RR sets R 1, R R 3 are generated. Node a has the highest influence is also the most frequent element across the RR sets. Given a graph G = (V, E, w), RIS captures the influence lscape of G through generating a set R of rom Reverse Reachale (RR) sets. The term RR set is also used in TIM/TIM+ [8, 16] referred to as hyperedge in [17]. Each RR set R j is a suset of V constructed as follows, Definition (Reverse Reachale (RR) set). Given G = (V, E, w), a rom RR set R j is generated from G y 1) selecting a rom node v V ) generating a sample graph g from G 3) returning R j as the set of nodes that can reach v in g. Node v in the aove definition is called the source of R j. Oserve that R j contains the nodes that can influence its source v. If we generate multiple rom RR sets, influential nodes will likely appear frequently in the RR sets. Thus a seed set S that covers most of the RR sets will likely maximize the influence spread I(S). Here a seed set S covers an RR set R j, if S R j. For convenience, we denote the coverage of set S as Cov R(S) = R j R min{ S Rj, 1}. An illustration of this intuition how to generate RR sets is given in Fig. 1. In the figure, three rom RR sets are generated following the LT model with sources, d c, respectively. The influence of node a is the highest among all the nodes in the original graph also is the most frequent node across the RR sets. This oservation is rigorously captured in the following lemma in [17]. Lemma 1. Given G = (V, E, w) a rom RR set R j generated from G. For each seed set S V, c, a I(S) = n Pr[S covers R j]. (1) Lemma 1 says that the influence of a node set S is proportional to the proaility that S intersects with a rom RR set. Thus, to find S that maximize I(S) can e approximated y finding S that intersects as many R j as possile. The critical issue is on the minimum numer of samples to provide ounded-error guarantees (ɛ, δ)-approximation The ounded-error guarantee we seek for is ased on the concept of (ɛ, δ)-approximation. Definition 3 ((ɛ, δ)-approximation). Let Z 1, Z,... e independently identically distriuted samples according to Z in the interval [0, 1] with mean µ Z A Monte Carlo estimator of µ Z, variance σz. ˆµ Z = 1 T Z i T () i=1 is said to e an (ɛ, δ)-approximation of µ Z if Pr[(1 ɛ)µ Z ˆµ Z (1 + ɛ)µ Z] 1 δ (3)

4 The numer of samples to guarantee an (ɛ, δ)-approximation for the Monte-Carlo method is well-known. For example, we shall use the elow two lemmas from [7]. Lemma ([7]). Let θ e the optimal numer samples that guarantee an (ɛ, δ)-approximation of µ Z define Υ(ɛ, δ/) = c ln(/δ)/ɛ with c = (e ) Cov(Z) = T i=1 Zi. Υ(ɛ,δ/) If T = µ Z, then ˆµ Z = Cov(Z) is an (ɛ, δ)- T approximate of µ Z T c 1 θ (4) for some constant c 1 0. Note that calculating T using the aove lemma requires the knowledge of the unknown value µ Z. To avoid coping with µ Z directly, [7] also provides a simple stopping condition which depends on the numer of Z j = 1 oserved. Lemma 3 ([7]). Let θ, µ Z e defined as in Lem. T e the numer of samples at which Cov(Z) 1 + (1 + ɛ)υ(ɛ, δ/), then ˆµ Z is an (ɛ, δ)-approximation of µ Z T (1 + ɛ)c 1 θ where c 1 is the constant in Lem.. Note that if only one side of the event in Eq. 3 is required, then Υ(ɛ, δ/) ecomes Υ(ɛ, δ) = c ln 1 1 (5) δ ɛ 3. RIS Framework Thresholds Based on Lem. 1, the IM prolem can e solved y the following two step algorithm. Generate a collection of RR sets, R, from G. Use the greedy algorithm for the Max-coverage prolem [8] to find a seed set Ŝk that covers the maximum numer of RR sets return Ŝk as the solution. Figure : Overview of algorithms ased on optimization over samples (ɛ is the error from approximating f(s) y ˆf T (S), OP T f OP T ˆfT are optimal solutions of f(.) ˆf T (.)). This two-step algorithm is actually an instance of a general class of methods illustrated in Fig.. The original prolem is an maximization prolem of f(.) over Ω S which is usually very hard to solve/approximate directly. Instead, we find an estimate ˆf T (.) of f(.). The estimation function ˆf T (.) is constructed y generating T = θ(ɛ, δ) samples where θ(ɛ, δ) is an explicit threshold. The threshold θ(ɛ, δ) decides the estimation quality of ˆf T (.) compared to f(.) usually is the most critical point in the methods. After having the function ˆf T (.), an α approximation algorithm A which is easier efficient to find the final solution S A of ˆf T (.) as well as f(.). The function f(.) that characterizes our maximization ojective covers a wide range of important prolems, e.g., targeted viral marketing [9], densest sugraph [30], which have very high complexity to approximate hence, need to rely on sampling algorithms. Thus, our proposed stop--stare methods can e modified to address many other optimization prolems under this category. We illustrate this aility y extending D- to solve the targeted viral marketing in our experiments section. Similar to determining θ, the core question in applying the aove algorithm for influence maximization prolem is that: How many RR sets are sufficient to provide a good approximate solution? In this case, the function f(s) is the influence function I(S) the samples are generated y RIS sampling. [8, 16] propose two such theoretical thresholds two proing techniques to realistically estimate those thresholds. However, their thresholds are not known to e any kind of minimum the proing method is ad hoc in [8] or far from the proposed threshold in [16]. Thus, they cannot provide any guarantee on the optimality of the numer of samples generated. Since the proing for an explicit threshold is seen to admit certain limitations, we propose a new approach. Instead of explicitly expressing a theoretical threshold trying to proe for it, we characterize the conditions that all the RIS-ased algorithms need to attain state the sufficient numer of samples to satisfy those conditions. Thus, we can define the minimum samples according to the necessary conditions further, propose D- to achieve the minimum. We will first define our RIS framework as enforcing the necessary conditions to guarantee the est known solution quality then propose two minimum thresholds ased on the precision parameters in our framework. Suppose that there is an optimal seed set, Sk, which has the maximum influence in the network. If there are multiple optimal sets with influence,, we choose the first one alphaetically to e Sk. Given 0 ɛ, δ 1, our unified RIS framework enforces two conditions: Pr[Î(Ŝk) (1 + ɛ a)i(ŝk)] 1 δ a (6) Pr[Î(S k) (1 ɛ ) ] 1 δ (7) where δ a + δ δ ɛ a + (1 1/e)ɛ ɛ. Based on the aove conditions, we define the RIS threshold as the following. Definition 4 (RIS Threshold). Given a graph G, 0 ɛ a, ɛ, δ a, δ 1, N(ɛ a, ɛ, δ a, δ ) is called an RIS Threshold in G w.r.t ɛ a, ɛ, δ a, δ if any numer R N(ɛ a, ɛ, δ a, δ ) of rom RR sets generated from G is sufficient to guarantee oth Eqs With the two aove conditions, we now prove that any N N(ɛ a, ɛ, δ a, δ ) RR sets are sufficient to guarantee (1 1/e ɛ)-approximation ratio. Theorem 1. Given a graph G, 0 ɛ a, ɛ, δ a, δ 1, let ɛ ɛ a + (1 1/e)ɛ δ δ a + δ, if the numer of RR sets R N(ɛ a, ɛ, δ a, δ ), then the two-step algorithm in our RIS framework returns Ŝk satisfying Pr[I(Ŝk) (1 1/e ɛ) ] 1 δ (8) which means Ŝk is an (1 1/e ɛ)-approximate solution. For revity, the proof is presented in the appendix. Existing RIS thresholds. For any ɛ, δ (0, 1), Tang et al. estalished in [8] an RIS threshold, N( ɛ, ɛ, δ ( ) n, k δ ( n k ) ) = (8 + ɛ)n ln /δ + ln ( n k ɛ )

5 In a later study [16], they reduced this numer to another RIS threshold, δ N(ɛ 1, ɛ ɛ 1, ( ) n, k δ ) ) = n ( n k ((1 1/e)α + β) ɛ, where α = ( ln ) 1 1 δ, β = (1 1/e) ( ln + ln ( n 1 δ k) ) ɛ α ɛ 1 =. (1 1/e)α+β Unfortunately, computing is intractale, thus, the proposed algorithms have to generate θ RR sets, where KP T + KP T + is the expected influence of a node set otained y sampling k nodes with replacement from G the ratio 1 is not upper-ounded. That is they may generate many + times more RR sets than KP T needed. 3.3 Two Types of Minimum Thresholds Based on the definition of RIS threshold, we now define two strong theoretical limits, i.e. type-1 minimum type- minimum thresholds. In Section 5, we will prove that our first proposed algorithm,, achieves, within a constant factor, a type-1 minimum threshold later, in Section 6, our dynamic algorithm, D-, is shown to otain, within a constant factor, the strongest type- minimum threshold. Definition 5 (type-1 minimum threshold). Given 0 ɛ, δ 1 0 ɛ a, ɛ, δ a, δ 1 satisfying δ a + δ δ ɛ a+(1 1/e)ɛ ɛ, Nmin(ɛ 1 a, ɛ, δ a, δ ) is called a type- 1 minimum threshold w.r.t ɛ a, ɛ, δ a, δ if Nmin(ɛ 1 a, ɛ, δ a, δ ) is the smallest numer of RR sets that satisfies oth Eq. 6 Eq. 7. If N(ɛ a, ɛ, δ a, δ ) is an RIS threshold, then any N such that N N(ɛ a, ɛ, δ a, δ ) is also an RIS threshold. We choose the smallest numer over all the RIS thresholds to e type-1 minimum as defined in Def. 5. All the previous methods [17, 8, 16] try to approximate Nmin(ɛ 1 a, ɛ, δ a, δ ) for some setting of ɛ a, ɛ, δ a, δ, however, they fail to provide any guarantee on how close their numers are to that threshold. In contrast, we show that achieves, within a constant factor, an type-1 minimum threshold in Sec. 5. Next, we give the definition of the strongest type- minimum threshold which is achieved y D- as shown in Sec. 6. Definition 6 (type- minimum threshold). Given 0 ɛ, δ 1, Nmin(ɛ, δ) is called the type- minimum threshold if Nmin(ɛ, δ) = min Nmin(ɛ 1 a, ɛ, δ a, δ ) (9) ɛ a,ɛ,δ a,δ where ɛ a + (1 1/e)ɛ = ɛ δ a + δ = δ. From Def. 6, it follows that type- minimum is the strongest possile threshold that we can achieve in the RIS-framework. 3.4 Achieving the Minimum Thresholds In the following sections, we propose two approximation algorithms, namely, Stop--Stare () the dynamic version D-, which respectively achieve the two theoretical minimum thresholds as well as the est known worstcase approximation ratio. In more details, oth D- employ the stop--stare strategy that doules the numer of RIS samples checks the quality of current solution y an independent influence estimation step. This strategy guarantees that we do not oversample, i.e., douling the necessary numer in the worst case. On a specific setting of the tuple {ɛ a, ɛ, δ a, δ }, guarantees a type-1 minimum threshold corresponding to that configuration. By specifically setting the parameters, algorithm is simpler than the dynamic D-. D- achieves the type- minimum threshold through dynamically finding the est set of parameter values at each exponential checking points. Thus, {ɛ a, ɛ, δ a, δ } are not specified in advance ut automatically detected y D-. Further, D- can reuse the RR sets generated in the independent influence estimation without changing the independence property of RR sets in susequent iterations. 4. STOP-AND-STARE ALGORITHM () In this section, we descrie our first Stop--Stare Algorithm () in details. keeps generating RR sets until douling the current numer stops to check the solution otained from the total RR sets generated. It uses Max- Coverage (Alg. ) to find the solution, Ŝk. Since the influence of Ŝk calculated from those RR sets is iased, independently generates another collection of RR sets in Estimate-Inf procedure (Alg. 3) to otain an uniased estimation of Ŝk influence. Then it stares at the two influences if they are close enough (satisfying s stopping conditions), it halts returns the found solution. Algorithm 1 Algorithm Input: Graph G, 0 ɛ, δ 1, a udget k. Output: An (1 1/e ɛ)-optimal solution, Ŝ k with at least (1 δ)-proaility. 1: Compute ɛ 1, ɛ, ɛ 3 according to Eq ɛ 3 : Λ 1 (1 + ɛ 1)(1 + ɛ )c ln 3 δ 3: R Generate Λ 1 rom RR sets y RIS 4: repeat 5: <Ŝk, Î(Ŝk)> Max-Coverage(R, k, n) 6: if Cov R(Ŝk) Λ 1 then 7: I c(ŝk) Estimate-Inf(G, Ŝk, ɛ, δ 3, R 1+ɛ ɛ 3 1 ɛ ɛ 8: if Î(Ŝk) (1 + ɛ 1)I c(ŝk) then 9: return Ŝk 10: end if 11: end if 1: R Generate R rom RR sets y RIS 13: until R (8 + ɛ)n ln δ +ln (n k) ɛ 14: return Ŝk 4.1 Algorithm This susection will detail the main procedure of where Max-Coverage (descried in Alg. ) Estimate-Inf (descried in Alg. 3) are incorporated. The precision parameters ɛ 1, ɛ, ɛ 3, δ 1, δ are specified in Eq. 18 discussed in Susection An illustration of those precision parameters in ( also later for D-) is provided in Fig. 3. This figure also demonstrate our ounding technique in our framework (Fig. ): The overall error is accumulated from the sampling errors to estimate of I(S k) I(Ŝk) the approximation error of Max-Coverage to find the set Ŝk. The algorithm is presented in Alg. 1. starts with initializing the values of three variales ɛ 1, ɛ ɛ 3 which are critical in deciding the stopping conditions will e specified in our analysis (Eq. 18) where we prove the type-1 )

6 Estimation using R Estimation using R True influence Overall error (ε): Confidence: I( S k ) (1 1 I( S k ) e ) I(S k ) I c ( S k ) ε 1 ε (w. p. δ ) ε 1 + ε + ε 1 ε + (1 1/e)ε 3 1 δ 1 δ I(S k ) OPT k = I S k ε 3 (w. p. δ 1 ) Algorithm Max-Coverage procedure Input: RR sets (R), k numer of nodes (n). Output: An (1 1/e)-optimal solution, Ŝk its estimated influence I c(ŝk). 1: Ŝ k = : for i=1 to k do 3: ˆv arg max {v V } (Cov R(Ŝk {v}) Cov R(Ŝk)) 4: Add ˆv to Ŝk 5: end for 6: return <Ŝk, Cov R(Ŝk) n/ R > Figure 3: Illustration of precision parameters ɛ 1, ɛ, ɛ 3, δ 1, δ (I(.) denotes the true influence function, Î(.), Ic(.) are the estimates of I(.) y the collection R of RR sets in the main algorithm y another independent collection R of the Estimate-Inf, Sk is an optimal solution). minimum threshold of. Having otained those values, the algorithm, then, computes Λ 1 (Line ) which determines the lower ound for the degree of the selected seed set, Ŝk. The central part of our algorithm iterates round y round: at round 0, generates Λ 1 rom RR sets (Line 3) simply ecause that is the lowest numer we can expect to meet the first condition at Line 6; at round i 1, it doules the numer of rom RR sets y introducing R more sets (Line 1). In each round, a cidate seed set Ŝk is selected y Max-Coverage together with its estimated influence. otains another estimated influence which is returned y the Estimate-Inf (Line 7) denoted y I c(ŝk). The two influences are used in the second condition (Line 8). Thus, contains two stopping conditions: (C1) The first condition Cov R(Ŝk) Λ 1 (Line 6) ensures that the coverage of the returned solution is at least Λ 1 which is important to guarantee the approximaility of Ŝ k Sk as shown in Lem.5 6. This condition remains true after the first time it is estalished. (C) The second condition Î(Ŝk) (1 + ɛ 1)I c(ŝk) (Line 8) compares estimates of I(Ŝk) otained y Max-Coverage y the Estimate-Inf procedure. This comparison is only active after the first stopping condition is met the numer of RR sets is large enough to trigger the Estimate-Inf. If these two estimates are close enough (within a multiplicative factor of 1 + ɛ 1), we confirm the approximaility of Ŝk (Lem. 5). As we will prove in Sec. 5, the two stopping conditions are sufficient to guarantee the (1 1/e ɛ)-approximation of Ŝk. Note that we use a threshold (8 + ɛ)n ln δ +ln (n k) which is ɛ taken from TIM+ [8] when considering OP T k = 1 to stop in cases of ad events. 4. Finding Max-Coverage We descrie the Max-Coverage procedure to find a (1 1/e)-coverage set. This algorithm plays the role of the α- approximation algorithm in Fig. where α = 1 1/e. Alg. illustrates the greedy Max-Coverage algorithm to select a size-k seed set. The whole procedure goes through k iterations in which, at each step, a node with maximum relative coverage, with respect to the previously chosen ones, is selected into the seed set Ŝk. As a well-known result [31], we have the following lemma. Lemma 4. The greedy Max-Coverage returns an (1 1/e)- approximate seed set Ŝk. This algorithm can e implemented in linear time in terms of the total size of all the RR sets as in [17, 8, 16], as a result, the complexity is upper-ounded y the generating RR sets. Thus, the complexity of the whole algorithm actually depends only on that for generating RR sets. 4.3 Influence Estimation Estimating influence of a given seed set S is a key component in our method is used in the places where we have a cidate seed set need to check whether the estimate of that set in the main algorithm is good enough. Our goal is to otain a good approximation (within a certain multiplicative-error) of the given seed set influence. Algorithm 3 Estimate-Inf procedure Input: Graph G, a seed set S, 0 ɛ, δ 1 maximum numer of samples, T max. Output: I c(s) such that I c(s) (1 + ɛ )I(S) with at least (1 δ )-proaility or exceeding T max. 1: Λ = 1 + c(1 + ɛ ) ln 1 1 δ ɛ : Cov = 0 3: for T from 1 to T max 4: Generate R j RIS(G) 5: Cov = Cov + min{ R j S, 1} 6: if Cov Λ then 7: return n Cov/T {n: numer of nodes} 8: end if 9: end for 10: return -1 {Exceeding T max RR sets} The estimating procedure is called Estimate-Inf detailed in Alg. 3. Given a node set S two parameters, ɛ, δ, Estimate-Inf returns an estimate I c(s) of I(S) such that I c(s) (1 + ɛ )I(S) with at least (1 ɛ )-proaility. The most crucial point in the procedure is determining Λ (Line 1) the related condition (Line 6) which compares the coverage of S with Λ. In our analysis, we show that the condition is sufficient to guarantee I c(s) (1 + ɛ )I(S) with at least (1 ɛ )-proaility. The main step in this procedure is the for loop in which rom RR sets are drawn one at a time until satisfying the condition. The loop iterates through the maximum of T max iterations where, as descried in Alg. 1, T max is a constant multiply of the total numer of RR sets generated in the main algorithm at that point. Choosing T max is a crucial point of this algo-

7 rithm since we run this at every iteration in. At the eginning when the solution Ŝ is not We define a rom variale X = min{ R j S, 1}, where R j is a rom RR set µ X = I(S)/n (Lem. 1). From Lem. 3 of the (ɛ, δ)-approximation criteria, we otain a direct corollary as stated elow, Corollary 1. The Estimate-Inf procedure of returns an estimate, I c(s), of I(S) such that Pr[I c(s) (1 + ɛ )I(S)] 1 δ = 1 δ/3 (10) 5. GUARANTEE AND PERFORMANCE ANALYSIS In this section, we will prove that returns a (1 1/e ɛ)-approximate solution with at least (1 δ)-proaility in Susec Susequently, is shown to require no more than a constant factor of a type-1 minimum threshold of RR sets with the same proaility in Susec Approximation Guarantee In this susection, we will prove that achieves the approximation factor of (1 1/e ɛ) with at least (1 δ)- proaility. The proof essentially contains two core components which are two conditions in our RIS framework (Susection 3.): 1) prove that at termination achieves a good approximation of the selected solution, Ŝ k, (Lem. 5) ) the hidden optimal solution, Sk, is also well-estimated (Lem. 6). Thus, comining these with Theo. 1 (Susection 3.) gives us the approximation factor (1 1/e ɛ) stated in Theo.. The first component states the quality of the estimated influence of the returned solution, Î(Ŝk), that has at termination is shown in Lem. 5. Lemma 5. returns a seed set, Ŝk, with Pr[Î(Ŝk) (1 + ɛ 1)I(Ŝk)] 1 δ/3 (11) where ɛ 1 = ɛ 1 + ɛ + ɛ 1ɛ. The proof is presented in the appendix. Based on Lem. 5, we prove the second component which also contains the influence estimation of the optimal solution, S k. Lemma 6. terminates with Pr[Î(S k) (1 ɛ 3) or Î(Ŝk) (1 + ɛ 1)I(Ŝk)] δ/3 Thus, Pr[Î(S k) (1 ɛ 3) ] 1 δ/3 (1) Lem. 5 6 are sufficient to prove the approximation quality of as stated y the following theorem. Theorem. Given 0 ɛ, δ 1 ɛ 1, ɛ, ɛ 3 satisfying ɛ 1 + ɛ + ɛ 1ɛ + (1 1/e)ɛ 3 ɛ, returns a seed set, Ŝk, such that Pr[I(Ŝk) (1 1/e ɛ) ] 1 δ (13) Proof. To prove the theorem, we will show that R is an RIS threshold then apply Theo. 1 to otain the (1 1/e ɛ)-approximation property of. Actually, we will later prove in Sec. 5 that R is not just RIS threshold ut, to within a constant factor, a type-1 minimum threshold. The first condition to ecome an RIS threshold is taken from Lem. 5, Pr[Î(Ŝk) (1 + ɛ 1)(1 + ɛ )I(Ŝk)] 1 δ/3 (14) The second condition is otained from Lem. 6, Pr[Î(S k) (1 ɛ 3) ] 1 δ/3 (15) From Eq. 14 Eq. 15, we conclude that R is an RIS threshold with ɛ a = (1+ɛ 1)(1+ɛ ) 1, ɛ = ɛ 3, δ a = δ/3 δ = δ/3. Notice that ɛ a + (1 1/e)ɛ = ɛ δ a + δ = δ. By Theo. 1, we have Pr[I(Ŝk) (1 1/e ɛ) ] 1 δ (16) which completes the proof of Theo.. 5. Achieving Type-1 Minimum Threshold We will analyze the numer of RR sets generated in oth the main Estimate-Inf procedures show that requires no more than a constant times a type-1 minimum RR sets. That makes the first method to achieve a type-1 minimum threshold Parameter Settings In Theo., we rely on the assumptions that ɛ 1, ɛ, ɛ 3 are given such that ɛ 1 + ɛ + ɛ 1ɛ + (1 1/e)ɛ 3 ɛ. (17) Determining their values plays an important role in the algorithm. Ideally, we want to generate just enough RR sets in the main algorithm to have a good estimate of Ŝk then check the influence y Estimate-Inf procedure. That is ecause we will discard all the RR sets in the Estimate-Inf after running, thus, if we start checking too early, we will waste a wealth of RR sets. On the other h, ecause of the douling scheme in the main algorithm, if starting checking way latter, we will generate a lot of unnecessary RR sets in the main procedure. In, we determine the values of ɛ 1, ɛ, ɛ 3 ased on experiments prove in next susection that generates only a constant times the type-1 threshold w.r.t these settings. In Sect. 6, we propose a dynamic algorithm, D-, to tune find the est values during the execution it requires, within a constant factor, the strongest type- minimum threshold RR sets. We carried experiments on various values of ɛ 1, ɛ ɛ 3 which satisfy Eq. 17 found the following values roustly giving low numer of RR sets, ɛ 1 = ɛ/6 ɛ = ɛ/ (18) ɛ 3 = ɛ/4(1 1/e) 5.. Numer of RR sets in the main algorithm Here, we will analyze the numer of RR sets generated in the main algorithm show that this numer is at most constant times N 1 min(ɛ a, ɛ, δ a, δ ) where ɛ a = ɛ 1 = ɛ 1 +ɛ + ɛ 1ɛ, ɛ = ɛ 3 δ a = δ/3, δ = δ/3. From Susec. 5.1, Pr[Î(Ŝk) (1 + ɛ a)i(ŝk)] 1 δ a (19) Pr[Î(S k) (1 ɛ ) ] 1 δ (0) are otained y enforcing two stopping conditions: Cov R(Ŝk) Λ 1 (C1)

8 Î(Ŝk) (1 + ɛ 1)I c(ŝk) (C) Now, to determine the numer of RR sets generated in, we will start from Nmin(ɛ 1 a, ɛ, δ a, δ ) RR sets where Eq. 19 Eq. 0 are satisfied determine how many more RR sets needed to meet (C1) (C). More specifically, we prove that in the cases that the inequalities in Eq. 19 Eq. 0 hold with Nmin(ɛ 1 a, ɛ, δ a, δ ) RR sets, needs at c most 1 N 1 1 1/e ɛ min(ɛ a, ɛ, δ a, δ ) RR sets, where c 1 is a constant defined in Lem.. Thus, will need no more than c 1 N 1 1 1/e ɛ min(ɛ a, ɛ, δ a, δ ) RR sets with the same proaility that two inequalities in Eq. 19 Eq. 0 hold which is at least 1 δ. The satisfaction of (C1) is stated as follows. Lemma 7. Suppose that with N 1 min(ɛ a, ɛ, δ a, δ ) RR sets, we have Î(Ŝk) (1 + ɛ a)i(ŝk) (1) Î(S k) (1 ɛ ) () c Then needs at most 1 N 1 1 1/e ɛ min(ɛ a, ɛ, δ a, δ ) RR sets to satisfy condition (C1). The proof is presented in the appendix. The following lemma states that condition (C) is also satisfied with the same numer of RR sets. Lemma 8. Given all the assumptions as in Lem. 7, c needs at most 1 N 1 1 1/e ɛ min(ɛ a, ɛ, δ a, δ ) RR sets to satisfy condition (C) with at least (1 δ/3)-proaility. From Lem. 7 Lem. 8 the fact that stops when two stopping conditions, i.e., (C1) (C), are satisfied, we have the following lemma. Lemma 9. The main algorithm of generates, within a constant factor, N 1 min(ɛ a, ɛ, δ a, δ ) RR sets with at least (1 δ)-proaility. Proof. From Lems. 7 8, we otain that the numer c of RR sets R 1 N 1 1 1/e ɛ min(ɛ a, ɛ, δ a, δ ) is the necessary condition to satisfy oth (C1) (C) with the proaility accumulated from Eq. 19, Eq. 0 Lem. 8 that is 1 δ δ/3 = 1 δ (follows from Lem. 6 that the inequalities in Eq are oth satisfied with proaility 1 δ ). Therefore, due to the douling scheme, will generate c at most 1 N 1 1 1/e ɛ min(ɛ a, ɛ, δ a, δ ) which remains to e a constant times Nmin(ɛ 1 a, ɛ, δ a, δ ) Numer of RR sets in Estimate-Inf procedure As presented in Alg. 3, the numer of RR sets generated in Estimate-Inf procedure is always smaller than 1+ɛ ɛ 3 1 ɛ times ɛ the numer of RR sets currently in the main algorithm. Therefore, the total numer of RR sets generated during the running of is smaller than 1+ɛ ɛ 3 1 ɛ times the sum of ɛ RR sets present at each iteration in the main algorithm. In turn, due to the douling ehavior, the sum of RR sets is smaller than twice that numer at the last iteration. Thus, ased on Lem. 9, we have the following lemma. Lemma 10. Estimate-Inf procedure of generates, to within a constant factor, N 1 min(ɛ a, ɛ, δ a, δ ) RR sets with at least (1 δ)-proaility. The constant in the lemma is 1+ɛ ɛ 3 1 ɛ times that in Lem. 9 ɛ that makes it 1 + ɛ ɛ 3 c 1 1 ɛ ɛ 1 1/e ɛ N min(ɛ 1 a, ɛ, δ a, δ ) (3) Comining Lems. 9 10, we conclude the overall numer of RR sets y the following theorem. Theorem 3. generates, to within a constant factor, N 1 min(ɛ a, ɛ, δ a, δ ) RR sets with at least (1 δ)-proaility. The total numer in this theorem is the sum of those in Lems. 9 10, ( ɛ ɛ 3 c 1 ) 1 ɛ 1 1/e ɛ N min(ɛ 1 a, ɛ, δ a, δ ) (4) ɛ 6. DYNAMIC ALGORITHM In this section, we propose the dynamic algorithm, named D-, that automatically selects ɛ 1, ɛ, ɛ 3, δ 1, δ during its execution. While maintaining the (1 1/e ɛ)-approximate solution as in, D- requires, to within a constant factor, the type- minimum threshold. This is the strongest result over all IM methods following the RIS framework. 6.1 D- Algorithm In Section 4, can e seen as generating two independent collections of RR sets: one is in the main algorithm to find the maximum seed set the another is for estimating the influence of the seed set found in the main procedure. Recall from Section 5 that we want to start checking the influence of Ŝk at the moment of having generated just enough RR sets so that the RR sets for checking are not wasted. However, detecting that moment is challenging since it depends not only on the networks ut also the particular execution of generating RR sets. Algorithm 4 D- Algorithm Input: Graph G, 0 ɛ, δ 1, k. Output: An (1 1/e ɛ)-optimal solution, Ŝk. 1: Λ c(1 + ɛ) ln 1 δ ɛ : R Generate Λ rom RR sets y RIS 3: <Ŝk, Î(Ŝk)> Max-Coverage(R, k) 4: repeat 5: R Generate R rom RR sets y RIS 6: I c(ŝk) Cov R (Ŝk) n/ R 7: ɛ 1 Î(Ŝk)/I c(ŝk) 1 8: if (ɛ 1 ɛ) then 9: ɛ ɛ ɛ 1, (1+ɛ 1 ) ɛ3 ɛ ɛ 1 (1 1/e) 3 c(1+ɛ 1 )(1+ɛ ) 10: δ 1 e Cov R (Ŝ k ) ɛ 11: δ e (Cov R 1)(Ŝ k ) ɛ c(1+ɛ ) 1: if δ 1 + δ δ then 13: return Ŝk 14: end if 15: end if 16: R R R 17: <Ŝk, Î(Ŝk)> Max-Coverage(R, k) 18: until R (8 + ɛ)n ln δ +ln (n k) ɛ 19: return Ŝk

9 The dynamic algorithm D-, descried in Alg. 4, addresses thoroughly these issue y dynamically computing the values of ɛ 1, ɛ, ɛ 3, δ 1, δ along its execution stops whenever the success proaility meets the requirement (Line 1). D- also reuses the checking RR sets for finding seed set without affecting the independence of RR sets in susequent iterations (Line 16). More specifically, D- uses the newly generated RR sets to estimates the influence of the seed set found in the previous iteration otain the current value of ɛ 1 (Line 7). From the value of ɛ 1, ɛ ɛ 3 can e computed accordingly (Line 8). The formula for computing ɛ ɛ 3 are ased on the condition that ɛ 1 + ɛ + ɛ 1ɛ + ɛ 3(1 1/e) ɛ (5) considering ɛ +ɛ 1ɛ ɛ 3(1 1/e) having similar roles. After that it relies on the two stopping conditions mentioned in to calculate δ 1 δ. D- stops when sum of δ 1 δ is less than or equal to θ which signifies that the success proaility meets the requirement. 6. D- Analysis We will sequentially show that D- achieves (1 1/e ɛ)-approximation factor in Susec requires only, to within a constant factor, the strongest type- minimum threshold of the RR sets in Susec Approximation Guarantee We show that D- preserve the (1 1/e ɛ)-approximation factor of y the following theorem. Theorem 4. Given a graph G, 0 ɛ 1 /e 0 δ 1 as the inputs, D- returns a (1 1/e ɛ)- approximate solution. Proof. we prove that the two stopping conditions in are still hold in D- thus D- has the same approximation factor as does. Directly from the Alg. 4, when D- terminates, the following conditions are satisfied, Cov R(Ŝk) = c(1 + ɛ 1)(1 + ɛ ) ln 1 δ 1 1 ɛ 3 Cov R (Ŝk) = 1 + c(1 + ɛ ) ln 1 δ 1 ɛ (6) (7) with ɛ 1 + ɛ + ɛ 1ɛ + (1 1/e)ɛ 3 = ɛ δ 1 + δ = δ. Eq. 6 is the first stopping conditions in. Eq. 7 is the checking condition in the Estimate-Inf procedure in, together with the setting of ɛ 1, we otain the second stopping condition in. Î(Ŝk) (1 + ɛ 1)(1 + ɛ )I(Ŝk) (8) Thus, the (1 1/e ɛ)-approximation factor is followed from which completes the proof. 6.. Achieving the Type- Minimum Threshold Here, we will prove a much stronger result than that of that D- requires only, to within a constant factor, the type- minimum of RR sets Nmin(ɛ, δ). Since ɛ a ɛ δ δ, let denote the constant M = ln δ ln 1 δ 4 (1 + ɛ a) ( 1 1 e ɛ )3 ln δ ln 1 δ 4 (1 + ɛ) ( 1 1 e ɛ )3 (9) Alternatively speaking, for a given graph G, we will show that D- terminates when the numer of RR sets in R is larger than or equal to Mc 1N min(ɛ, δ), where c 1 is a constant defined in Lem., with at least (1 δ)-proaility. Due to the douling scheme, D- will generate no more than twice that numer. More specifically, we will prove that the stopping condition (Line 11) is satisfied. Recall N min(ɛ, δ) RR sets implies where Pr[Î(Ŝk) (1 + ɛ a)i(ŝk)] 1 δ a (30) Pr[Î(S k) (1 ɛ ) ] 1 δ (31) N 1 min(ɛ a, ɛ, δ a, δ ) = Apply Theo. 1, we have min Nmin(ɛ 1 a, ɛ, δ a, δ ) ɛ a,ɛ,δ a,δ Pr[I(Ŝk) (1 1/e ɛ) ] 1 δ (3) First, we show that the value of ɛ 1 is upper ounded y the following lemma. Lemma 11. Given 0 ɛ 1 /e 0 δ 1, if R Mc 1N min(ɛ, δ), then, Pr[ɛ 1 ɛ a + ɛ / ] 1 δ (33) 1 ɛ / where c 1 is a constant defined in Lem.. The proof is presented in the appendix. Based on the result of Lem. 11, we next prove that D- will terminate when R Mc 1Nmin(ɛ, δ) with at least (1 δ)-proaility. Lemma 1. If R Mc 1N min(ɛ, δ) ɛ 1 ɛ a +ɛ / 1 ɛ /, then, δ 1 + δ δ (34) Therefore, D- needs at most a constant times the type- minimum with the proaility of Pr[ɛ 1 ɛ a +ɛ / 1 ɛ ] which is / at least 1 δ. Since we doule the RR sets every round, the actual numer of RR sets is at most twice the necessary one. Thus, we conclude that D- generates at most a constant times the type- minimum with proaility of at least 1 δ. Theorem 5. Given a graph G, 0 ɛ 1 /e 0 δ 1, let Nmin(ɛ, δ) e the type- minimum threshold of RR sets. D- generates no more than, to within a constant factor, Nmin(ɛ, δ) RR sets with the proaility of at least (1 δ). The Theo. 5 follows directly from Lems EXPERIMENTS Backing y the strong theoretical results, we will experimentally show that D- outperform the existing state-of-the-art IM methods y a large margin. Specifically, D- are several orders of magnitudes faster than IMM TIM+, the est existing IM methods with approximation guarantee, while having the same level of solution quality. D- also require several times less memory than the other algorithms. To demonstrate the applicaility of the proposed algorithms, we apply our methods on a critical application of IM, i.e., Targeted Viral Marketing (TVM) introduced in [9] show the significant improvements in terms of performance over the existing methods.

10 Expected Influence D- IMM TIM+ TIM CELF Expected Influence (a) NetHEPT D- IMM TIM+ TIM (a) NetHEPT Dataset Expected Influence Expected Influence D- IMM TIM+ TIM Tale : Datasets Statistics D- IMM TIM+ TIM () NetPHY (c) DBLP Figure 4: Expected Influence under LT model. D- IMM TIM+ TIM #Nodes #Edges Avg. degree NetHELP 3 15K 59K 4.1 NetPHY 3 37K 181K 13.4 Enron 3 37K 184K 5.0 Epinions 3 13K 841K 13.4 DBLP 3 655K M 6.1 Orkut 3 3M 117M 78 Twitter [3] 41.7M 1.5G 70.5 Friendster M 1.8G Experimental Settings All the experiments are run on a Linux machine with.ghz Xeon 8 core processor 100GB of RAM. We carry experiments under oth LT IC models on the following algorithms datasets. Algorithms compared. On IM experiments, we compare D- with the group of top algorithms that provide the same (1 1/e ɛ)-approximation guarantee. More specifically, CELF++ [33], one of the fastest greedy algorithms, IMM [16], TIM/TIM+ [8], the est current RIS-ased algorithms, are selected. For experimenting with TVM prolem, we apply our Stop--Stare algorithms on this context compare with the most efficient method for the prolem, termed KB-TIM, in [9]. Datasets. For experimental purposes, we choose a set of 8 datasets from various disciplines: NetHEPT, NetPHY, DBLP are citation networks, -Enron is communication network, Epinions, Orkut, Twitter Friendster are online social networks. The description summary of those datasets is in Tale. On Twitter network, we also have the actual tweet/retweet dataset we use these data to extract the target users whose tweets/retweets are relevant to a certain set of keywords. The experiments on TVM are run on the Twitter network with the extracted targeted groups of users. Parameter Settings. For computing the edge weights, we follow the conventional computation as in [8, 13, 4, 6], () NetPHY (c) DBLP Figure 5: Expected Influence under IC model. Expected Influence Expected Influence D- IMM TIM+ TIM Expected Influence Expected Influence D IMM TIM (d) Twitter D IMM TIM+ (d) Twitter the weight of the edge (u, v) is calculated as w(u, v) = 1 d in (v) where d in(v) denotes the in-degree of node v. In all the experiments, we keep ɛ = 0.1 δ = 1/n as a general setting or explicitly stated otherwise. For the other parameters defined for particular algorithms, we take the recommended values in the corresponding papers if availale. We also limit the running time of each algorithm in a run to e within 4 hours. 7. Experiments with IM prolem To show the superior performance of the proposed algorithms on IM task, we ran the first set of experiments on four real-world networks, i.e., NetHEPT, NetPHY, DBLP, Twitter. We also test on a wide spectrum of the value of k, typically, from 1 to 0000, except on NetHEPT network since it has only 1533 nodes. The solution quality, running time, memory usage are reported sequentially in the following. We also present the actual numer of RR sets generated y, D- IMM when testing on four other datasets, i.e., Enron, Epinions, Orkut Friendster Solution Quality We first compare the quality of the solution returned y all the algorithms on LT IC models. The results are presented in Fig. 4 Fig. 5, respectively. The CELF++ algorithm is only ale to run on NetHEPT due to time limit. From those figures, all the methods return comparale seed set quality with no significant difference. The results directly give us a etter viewpoint on the asic network property that a small fraction of nodes can influence a very large portion of the networks. Most of the previous researches only find up to 50 seed nodes provide a limited view of this phenomenon. Here, we see that after around 000 nodes have een selected, the influence gains of selecting more seeds ecome very slim. 7.. Running time We next examine the performance in terms of running time of the tested algorithms. The results are shown in Fig. 6 Fig. 7. Both D- significantly outper-

11 (a) NetHEPT 10 - () NetPHY 10 - (c) DBLP Figure 6: Running time under LT model (d) Twitter Memory used (MB) Memory used (MB) (a) NetHEPT (a) NetHEPT (a) NetHEPT Memory used (MB) Memory used (MB) 10 - () NetPHY 10 - (c) DBLP Figure 7: Running time under IC model () NetPHY Memory used (MB) (c) DBLP Figure 8: Memory usage under LT model () NetPHY Memory used (MB) (c) DBLP Figure 9: Memory usage under IC model Memory used (MB) Memory used (MB) (d) Twitter (d) Twitter (d) Twitter Data running time(s) numer of RR sets(thouss) k = 1 k = 500 k = 1000 k = 1 k = 500 k = 1000 D- IMM D- IMM D- IMM D- IMM D- IMM D- IMM Enron Epin Orkut Frien Tale 3: Across dataset view of performance of, D- IMM on various datasets under LT model. form the other competitors y a huge margin. Comparing to IMM, the est known algorithm, D- run up to several orders of magnitudes faster. TIM+ IMM show similar running time since they operate on the same philosophy of estimating optimal influence first then calculating the necessary samples to guarantee the approximation for all possile seed sets. However, each of the two steps displays its own weakness. In contrast, D- follows the Stop--Stare mechanism to thoroughly address those weaknesses thus exhiit remarkale improvements. Comparing etween D-, since D-possesses the type- minimum threshold compared to the weaker type- 1 threshold of with the same precision settings ɛ, δ, D- achieves consideraly etter performance commits up to an order of magnitudes speedup.

Revisiting of Revisiting the Stop-and-Stare Algorithms for Influence Maximization

Revisiting of Revisiting the Stop-and-Stare Algorithms for Influence Maximization Hung T. Nguyen 1,2, Thang N. Dinh 1, and My T. Thai 3 1 Virginia Commonwealth University, Richmond VA 23284, USA {hungnt,