CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Size: px

Start display at page:

Download "CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University"

William Boone
6 years ago
Views:

1 CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

2 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 2 (1) New problem: Outbreak detection (2) Develop an approximation algorithm It is a submodular opt. problem! (3) Speed-up greedy hill-climbing Valid for optimizing general submodular functions (i.e., also works for influence maximization) (4) Prove a new data dependent bound on the solution quality Valid for optimizing general submodular functions (i.e., also works for influence maximization)

3 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 3 Given a real city water distribution network And data on how contaminants spread in the network Detect the contaminant as quickly as possible Problem posed by the US Environmental Protection Agency S

4 10/24/2012 [Leskovec et al., KDD 07] Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, Part 2-4 Posts Blogs Information cascade Time ordered hyperlinks Which blogs should one read to detect cascades as effectively as possible?

5 [Leskovec et al., KDD 07] Want to read things before others do. Detect blue & yellow soon but miss red. Detect all stories but late. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 5

6 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 6 Both of these two are an instance of the same underlying problem! Given a dynamic process spreading over a network We want to select a set of nodes to detect the process effectively Many other applications: Epidemics Influence propagation Network security

Utility of placing sensors: Water flow dynamics, demands of households, For each subset S V compute utility f(s) High impact Contamination outbreak S3 S1 S2 Low impact outbreak Medium impact outbreak

7 Utility of placing sensors: Water flow dynamics, demands of households, For each subset S V compute utility f(s) High impact Contamination outbreak S3 S1 S2 Low impact outbreak Medium impact outbreak S3 S1 S4 Set V of all network junctions Sensor reduces impact through early detection! S1 S4 S2 High sensing quality f(s) = 0.9 Low sensing quality f(s)= /24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 7

8 [Leskovec et al., KDD 07] Given: Graph G(V, E) Data on how outbreaks spread over the G: For each outbreak i we know the time T(i, u) when outbreak i contaminates node u Water distribution network (physical pipes and junctions) Simulator of water consumption&flow (built by Mech Eng. people) We simulate the contamination spread for every possible location. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 8

9 [Leskovec et al., KDD 07] Given: Graph G(V, E) Data on how outbreaks spread over the G: For each outbreak i we know the time T(i, u) when outbreak i contaminates node u a b c b a c The network of the blogosphere Traces of the information flow Collect lots of blogs posts and trace hyperlinks to obtain data about information flow from a given blog. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 9

10 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 10 Given: Graph G(V, E) Data on how outbreaks spread over the G: For each outbreak i we know the time T(i, u) when outbreak i contaminates node u Goal: Select a subset of nodes S that maximize the expected reward: max S V f S = P i f i S subject to: cost(s) < B i Expected reward for detecting outbreak i f i S is penalty reduction: f i S = π i π i (S)

11 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 11 Reward (1) Minimize time to detection (2) Maximize number of detected propagations (3) Minimize number of infected people Cost (node/location dependent): Reading big blogs is more time consuming Placing a sensor in a remote location is expensive outbreak i Monitoring blue node saves more people than monitoring the green node f(s)

12 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 12 Objective functions: 1) Time to detection (DT) How long does it take to detect a contamination? Penalty: π i (t) = min {t, T mmm } 2) Detection likelihood (DL) How many contaminations do we detect? We incur penalty if we don t detect: π i (t) = 0, π i ( ) = 1 3) Population affected (PA) How many people drank contaminated water? π i (t) = {# of blogs in cascade i at time t}. Observation: In all cases detecting sooner does not hurt!

13 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 13 Observation: Diminishing returns S 1 New sensor: S s S 1 S 3 S 2 S 2 S 4 Placement S={s 1, s 2 } Adding s helps a lot Placement S ={s 1, s 2, s 3, s 4 } Adding s helps very little

14 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 14 Claim: For all A B V and sensors s V\B f A s f A f B s f B Proof: Fix cascade i Show f i A = π i π i (T(A, i)) is submodular Consider A B V and sensor s V\B When does node s detect cascade i? 3 Cases: (1) T s, i T(A, i) then f i A s = f i A, f i B s = f i B and so f i A s f i A = 0 = f i B s f i B Since s detects too late, nobody benefits

15 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 15 Proof (contd.): 3 Cases: (2) T B, i T s, i < T(A, i) then f i A b f i A 0 = f i B s f i B s detects sooner than any node in A but after all in B. So u only helps improve the solution A. (3) T s, i < T(B, i) then f i A s f i A = π i π i T s, i f i (A) π i π i T s, i f i (B) = f i B s f i B Ineqaulity is due to non-decreasingness of f i ( ), i.e., f i A f i (B) So, f i ( ) is submodular! So, f( ) is also submodular f S = P i f i S i

16 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, Part 2-16 a b c d e Hill-climbing reward b Add sensor with highest marginal gain c d a e What do we know about optimizing submodular functions? A hill-climbing (i.e., greedy) is near optimal (1 1 e OOO) But: (1) This only works for unit cost case! (each sensor costs the same) For use each sensor s has cost c(s) (2) Hill-climbing algorithm is slow At each iteration we need to re-evaluate marginal gains of all nodes Runtime O( V K) for placing K sensors

17 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 17 Consider: Hill-climbing that ignores cost Ignore sensor cost Repeatedly select sensor with highest marginal gain Do this until the budget is exhausted How well does this work?

18 [Leskovec et al., KDD 07] Bad example: n sensors, budget B s 1 : reward r, cost B s 2 s n : reward r ε, cost 1 Hill-climbing always prefers more expensive sensor s 1 with reward r (and exhausts the budget) It never selects cheaper sensors with reward r ε For variable cost it can fail arbitrarily badly! Idea: What if we optimize benefit-cost ratio? s i = arg max s V f A i 1 {s} f(a i 1 ) c s Greedily pick sensor s i that maximizes benefit to cost ratio. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 18

19 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 19 Benefit-cost ratio can also fail arbitrarily badly! Consider: budget B: 2 sensors s 1 and s 2 : Costs: c(s 1 ) = ε, c(s 2 ) = B Only 1 cascade: f(s 1 ) = 2ε, f(s 2 ) = B Then benefit-cost ratio is b c(s 1 ) = 2 and b c(s 2 ) = 1 So, we first select s 1 and then can not afford s 2 We get reward 2ε instead of B. Now send ε 0 and we get arbitrarily bad solution!

20 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 20 CELF (cost-effective lazy forward-selection) A two pass greedy algorithm: Set (solution) SS: Use benefit-cost greedy Set (solution) SSS: Use unit-cost greedy Final solution: S = arg max (f(s ), f(s )) How far is CELF from (unknown) optimal solution? Theorem: CELF is near optimal CELF achieves ½(1-1/e) factor approximation!

21 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, Part 2-21 a b c d e Hill-climbing reward b Add sensor with highest marginal gain c d a e What do we know about optimizing submodular functions? A hill-climbing (i.e., greedy) is near optimal (1 1 OOO) e But: (2) Hill-climbing algorithm is slow! At each iteration we need to reevaluate marginal gains of all nodes Runtime O( V K) for placing K sensors

22 [Leskovec et al., KDD 07] In round i + 1: So far we picked S i = {s 1,, s i } Now pick s i = arg max u f(s i {u}) f(s i ) This our old friend greedy hill-climbing algorithm. It maximizes the marginal benefit δ u (S i ) = f(s i {u}) f(s i ) By submodularity property: f S i u f S i f S j u f S j for i < j Observation: By submodularity: For every u δ u (S i ) δ u (S j ) for i j since S S i j δ u (S i ) δ u (S j ) Marginal benefits δ u only shrink! u (as S grows) Activating node u in step i helps more than activating it at step j (j>i) 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 22

23 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 23 Idea: Use δ i as upper-bound on δ j (j>i) Lazy hill-climbing: Keep an ordered list of marginal benefits δ i from previous iteration Re-evaluate δ i only for top node Re-sort and prune Marginal gain a b c d e S 1 ={a} f(s {u}) f(s) f(t {u}) f(t) S T

24 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 24 Idea: Use δ i as upper-bound on δ j (j>i) Lazy hill-climbing: Keep an ordered list of marginal benefits δ i from previous iteration Re-evaluate δ i only for top node Re-sort and prune Marginal gain a b c d e S 1 ={a} f(s {u}) f(s) f(t {u}) f(t) S T

25 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 25 Idea: Use δ i as upper-bound on δ j (j>i) Lazy hill-climbing: Keep an ordered list of marginal benefits δ i from previous iteration Re-evaluate δ i only for top node Re-sort and prune Marginal gain a d b e c S 1 ={a} S 2 ={a,b} f(s {u}) f(s) f(t {u}) f(t) S T

26 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 26 Back to the solution quality! The (1-1/e) bound for submodular functions is the worst case bound (worst over all possible inputs) Data dependent bound: Value of the bound depends on the input data On easy data, hill climbing may do better than 63% Can we say something about the solution quality when we know the input data?

27 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 27 Suppose S is some solution to f(s) s.t. S k f(s) is monotone & submodular Let T = {t 1,, t k } be the OPT solution For each u S let δ u = f(s {u}) f(s) Order δ u so that δ δ 1 2 δ n S k i=1 δ i Note: This is a data dependent bound (δ u depend on input data) Bound holds for any algorithm Makes no assumption about how S is computed For some inputs it can be very loose (worse than 63%) Then: f T f S +

28 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 28 Claim: For each u S let δ u = f S u f S Order δ u so that δ 1 δ 2 δ n S Then: f T f S + Proof: k i=1 δ i f T f T S = f S + k f S t 1 t i f S t 1 t i 1 i=1 k i=1 k i=1 f S + f S t i f S = f S + k δ ti Instead of taking t i T (of benefit δ ti ), we take the best possible element (δ i ) f S + i=1 δ i f T f S + i=1 δ i k (we proved this last time)

29 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 29 Real metropolitan area network V = 21,000 nodes E = 25,000 pipes water Use a cluster of 50 machines for a month Simulate 3.6 million epidemic scenarios (152 GB of epidemic data) By exploiting sparsity we fit it into main memory (16GB)

30 Solution quality F(A) Higher is better Offline the (1-1/e) bound Data-dependent bound Hill Climbing Number of sensors placed Data-dependent bound is much tighter (gives more accurate estimate of alg. performance) 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 30

31 [w/ Ostfeld et al., J. of Water Resource Planning] Author Score Placement heuristics perform much worse CELF 26 Sandia 21 U Exter 20 Bentley systems 19 Technion (1) 14 Bordeaux 12 U Cyprus 11 U Guelph 7 U Michigan 4 Michigan Tech U 3 Malcolm 2 Proteo 2 Technion (2) 1 Battle of Water Sensor Networks competition 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 31

32 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 32 Different objective functions give different sensor placements Population affected Detection likelihood

33 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 33 CELF is 10 times faster than greedy hill-climbing!

34 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 34 = I have 10 minutes. Which blogs should I read to be most up to date?? = Who are the most influential bloggers?

35 Want to read things before others do. Detect blue & yellow soon but miss red. Detect all stories but late. 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 35

36 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 36 Online bound is much tighter! 13% instead of 37% Old bound Our bound CELF

37 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 37 Heuristics perform much worse! One really needs to perform the optimization

38 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 38 CELF has 2 sub-algorithms. Which wins? Unit cost: CELF picks large popular blogs: instapundit.com, michellemalkin.com Cost-benefit: Cost proportional to the number of posts We can do much better when considering costs

39 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 39 Problem: Then CEF picks lots of small blogs that participate in few cascades We pick best solution that interpolates between the costs f(s)=0.3 Score f(s)=0.4 We can get good solutions with few blogs and few posts f(s)=0.2 Each curve represents a set of solutions S with the same final reward f(s)

40 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, Part 2-40 We want to generalize well to future (unknown) cascades Limiting selection to bigger blogs improves generalization!

41 [Leskovec et al., KDD 07] 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 41 CELF runs 700 times faster than simple hillclimbing algorithm

10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.

Blogosphere, Memetracking Scale-Free Densification power law, Shrinking diameters Strength of weak ties, Core-periphery

Preferential attachment, Copying model Microscopic model of evolving networks Kronecker Graphs Decentralized search Models for

42 10/24/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, 42 Observations Models Algorithms Small diameter, Edge clustering Patterns of signed edge creation Viral Marketing, Blogosphere, Memetracking Scale-Free Densification power law, Shrinking diameters Strength of weak ties, Core-periphery Erdös-Renyi model, Small-world model Structural balance, Theory of status Independent cascade model, Game theoretic model Preferential attachment, Copying model Microscopic model of evolving networks Kronecker Graphs Decentralized search Models for predicting edge signs Influence maximization, Outbreak detection, LIM PageRank, Hubs and authorities Link prediction, Supervised random walks Community detection: Girvan-Newman, Modularity

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Find most influential set S of size k: largest expected cascade size f(s) if set S is activated