CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Size: px

Start display at page:

Download "CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University"

Evelyn Thomas
5 years ago
Views:

1 CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

2 Find most influential set S of size k: largest expected cascade size f(s) if set S is activated a c b Want to solve: d e Network, each edge activates with prob. p max S = k 1 f ( S) = fi( S) I Consider S={a,d} then: f 1 (S)=5, f 2 (S)=4, f 3 (S)=3 and f(s) = 4 f i I Activate edges by coin flipping Multiple realizations i: 10/26/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 2 a a a influence set c c b b c b d d d e e f f e f

3 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 3 Algorithm: Hill Climbing At step i pick node u that: max u f(s i 1 u ) Thm: Hill climbing produces a solution S where: f(s) (1-1/e)*OPT Claim holds for functions f() with 2 properties: f is monotone: S T: f(s) f(t) and f({})=0 f is submodular: S T : f(s {u}) f(s) f(t {u}) f(t) Our f(s) is submodular! Why?

4 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 4 Submodularity: f(s {u}) f(s) f(t {u}) f(t) Basic fact 1: If f 1 (x),,f k (x) are submodular, and c 1,,c k 0 then F x = i c i f i x is also submodular Basic fact 2: A simple submodular function Sets A 1,, A m f S = S T T S A i i S (size of the union of sets A i, i S) u The more sets you already chose the less new area a new set u will cover

5 10/26/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 5 a c b d e f Activate edges by coin flipping Fix outcome i of coin flips f i (a) f i (u) = influence set: set of nodes a reachable from u on live-edge paths f i (S) = size of cascades from S on coin flips i f i S = u S f i (u) f i (S) is submodular! f i (v) are sets and f i (S) is the size of the union f i (a) Expected influence set size: f S = 1 i I f i (S) f(s) is submodular! I f(s) is linear combination of submodular functions a a f i (a) b c b c b c f i (d) e d f e d f i (d) f e d f f i (d)

7 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 7 Claim: If f(s) is monotone and submodular. Hill climbing produces a solution S where: f(s) (1-1/e)*OPT (f(s)>0.63*opt) Setting: Keep adding nodes that give the largest gain Start with S 0 ={}, produce sets S 1, S 2,,S k Add elements one by one Marginal gain: δ i = f(s i ) - f(s i-1 ) Let T={t 1 t k } be the optimal set of size k We need to show: f(s) (1-1/e) f(t)

8 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 8 k j=1 f(a B) f(a) [f(a {b j }) f(a)] where: B = {b 1,,b k } and f is submodular, Proof: Let B i = {b 1, b i }, so we have B 1, B 2,, B k =B f A B f A = k f A B i f A B i 1 k i=1 k i=1 i=1 = f A B i 1 b i f A B i 1 By submodularity since A X {b} A {b} f A {b i } f A Work out the sum. Everything but 1 st and last term cancels out: f A B 1 f A B 0 + f A B 2 f A B 1 + f A B 3 + f A B k f(a B k 1 )

9 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 9 f T f S i T Remember: δ i = f(s i ) - f(s i-1 ) (by monotonicity) = f S i T f S i + f S i k j=1 f S i {t j } f S i + f(s i ) k j=1 δ i+1 + f S i = f S i + k δ i+1 Thus: f T f S i + k δ i+1 δ i+1 1 k [f T f(s i)] (by prev. slide) T = {t 1, t k } t j is one choice of a next element. We greedily choose the best one, for a gain of δ i+1. This is the hill-climbing assumption.

10 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 10 We just showed: δ i+1 1 k [f T f(s i)] What is f(s i+1 )? f S i+1 = f S i + δ i+1 f S i + 1 k f T f S i = 1 1 k f S i + 1 k f(t) What is f(s k )?

11 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 11 Claim: f Proof by induction: i = 0: i 1 ( Si ) 1 1 f ( T ) k f S 0 = f({}) = k 0 f T = 0

12 10/13/2009 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 12 Claim: f Proof by induction: At i + 1: i 1 ( Si ) 1 1 f ( T ) k f S i k f S i + 1 k f T 1 1 k k i f T + 1 k f T = k i+1 f(t)

13 10/26/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 13 Thus: f S = f S k k k f T Then: 1 e f S k 1 1 e f(t) qed.

14 10/26/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 14 We just proved: Hill climbing finds solution S which f(s) (1-1/e)*OPT i.e., f(s) 0.63*OPT This is a data independent bound This is a worst case bound No matter what is the input data (influence sets) we know that Hill Climbing won t do worse than 0.63*OPT Data dependent bound: Value of the bound depends on the input data On easy data, hill climbing may do better than 63%

15 Suppose S is some solution to f(s) s.t. S k f(s) is monotone & submodular Let T = {t 1,,t k } be the OPT solution For each u S let δ u = f(s {u})-f(s) Order δ u so that δ 1 δ 2 δ n Then: f S f T Note: k i=1 This is a data dependent bound (δ u depend on input data) Bound holds for any algorithm δ i Makes no assumption about how S is computed For some inputs the bound can be loose (worse than 63%) 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 15

16 For each u S let δ u = f(s {u})-f(s) Order δ u so that δ 1 δ 2 δ n k Then: f S f T i=1 δ i Proof: f T f T S = f S + k f S t 1 t i f S t 1 t i 1 i=1 k i=1 k i=1 f S + f S t i f S = f S + k δ ti k i=1 f S + i=1 δ i f T f S + δ i Instead of adding t i T (of benefit δ ti ), we add the best possible element (δ i ) 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 16

17 10/26/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 17

18 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 18 Hill-climbing a b c d e reward Add node with highest marginal gain What do we know about optimizing submodular functions? A hill-climbing is near optimal (1-1/e (~63%) of OPT) But Hill-climbing algorithm is slow At each iteration we need to reevaluate marginal gains It scales as O(n k)

19 [Leskovec et al., KDD 07] In round i+1: So far we picked S i = {s 1,,s i } Now pick s i+1 = argmax u f(s i {u}) - f(s i ) maximize the marginal benefit δ u (S i ) = f(s i {u}) - f(s i ) By submodularity property: f S i u f S i f S j u f S j for i<j Observation: By submodularity : For some u δ u (S i ) δ u (S j ) for i j since S i S j Marginal benefits δ x only shrink! δ u (S i ) δ u (S j ) Activating node u in step i helps more than activating it at step j (j>i) 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 19 u

20 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 20 Idea: Use δ i as upper-bound on δ j (j>i) Lazy hill-climbing: Keep an ordered list of marginal benefits δ i from previous iteration Re-evaluate δ i only for top node Re-sort and prune Marginal gain a b c d e S 1 ={a} f(s {u}) f(s) f(t {u}) f(t) S T

21 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 21 Idea: Use δ i as upper-bound on δ j (j>i) Lazy hill-climbing: Keep an ordered list of marginal benefits δ i from previous iteration Re-evaluate δ i only for top node Re-sort and prune Marginal gain a b c d e S 1 ={a} f(s {u}) f(s) f(t {u}) f(t) S T

22 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 22 Idea: Use δ i as upper-bound on δ j (j>i) Lazy hill-climbing: Keep an ordered list of marginal benefits δ i from previous iteration Re-evaluate δ i only for top node Re-sort and prune Marginal gain a d b e c S 1 ={a} S 2 ={a,b} f(s {u}) f(s) f(t {u}) f(t) S T

24 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 24 Given a real city water distribution network And data on how contaminants spread in the network Detect the contaminant as quickly as possible Problem posed by the US Environmental Protection Agency S

10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.

25 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 25 Utility of placing sensors Water flow dynamics, demands of households, For each subset A V compute utility F(A) High impact Contamination outbreak S3 S1 S2 Low impact outbreak Medium impact outbreak S3 S1 S4 Set V of all network junctions Sensor reduces impact through early detection! S1 S4 S2 High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01

26 [Leskovec et al., KDD 07] Given a graph G(V,E) And the data on how outbreaks spread over the network: For each outbreak i we know the time T(i,u) when outbreak i contaminated node u Goal: Select a subset of nodes A that maximize the expected reward: max S V f S = P i f i S Reward: Raise the alarm and save the most people i Expected reward for detecting outbreak i outbreak i Monitoring blue node saves more people than monitoring the green node 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 26

27 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 27 Observation: Diminishing returns S 1 New sensor: S s S 1 S 3 S 2 S 2 S 4 Placement S={s 1, s 2 } Adding s helps a lot Placement S ={s 1, s 2, s 3, s 4 } Adding s helps very little

28 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 28 Claim: The reward function is submodular Consider outbreak i: f i (u k ) = set of nodes saved from u k f i (S) = size of union f i (u k ), u k S f i is submodular Global optimization: f(s) = i P(i) f i (S) f(s) is submodular outbreak i u 2 R i (u 2 ) u 1 R i (u 1 )

29 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 29 Real metropolitan area network V = 21,000 nodes E = 25,000 pipes water Use a cluster of 50 machines for a month Simulate 3.6 million epidemic scenarios (152 GB of epidemic data) By exploiting sparsity we fit it into main memory (16GB)

30 Solution quality F(A) Higher is better Offline the (1-1/e) bound Data-dependent bound Hill Climbing Number of sensors placed Data-dependent bound is much tighter (gives more accurate estimate of alg. performance) 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 30

31 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 31 Placement heuristics perform much worse

32 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 32 = I have 10 minutes. Which blogs should I read to be most up to date?? = Who are the most influential bloggers?

33 Want to read things before others do. Detect blue & yellow soon but miss red. Detect all stories but late. 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 33

34 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 34 Online bound is much tighter: 87% instead of 63% (1-1/e) bound Data dependent bound Hill Climbing

35 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 35 Heuristics perform much worse

36 [Leskovec et al., KDD 07] 10/20/2010 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 36 Naïve hill climbing Lazy hill climbing Lazy evaluation runs 700 times faster than naïve Hill Climbing algorithm

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu