An Efficient reconciliation algorithm for social networks

Size: px

Start display at page:

Download "An Efficient reconciliation algorithm for social networks"

Hollie O’Brien’
5 years ago
Views:

1 An Efficient reconciliation algorithm for social networks Silvio Lattanzi (Google Research NY) Joint work with: Nitish Korula (Google Research NY) ICERM Stochastic Graph Models

2 Outline Graph reconciliation Model and theoretical results. Experimental results From theory to practice. Open problems and future directions

3 Graph reconciliation

4 Real world motivations

5 Real world motivations Intra-language network

6 Real world motivations Intra-language network Inter-language network

7 Real world motivations Can we use intra-language information to improve interlanguage graph?

8 Real world motivations Can we use intra-language information to improve interlanguage graph?

9 Real world motivations Can we use intra-language information to improve interlanguage graph??

10 Real world motivations

11 Real world motivations

12 Real world motivations

13 Real world motivations

14 Graph reconciliation problem Given two networks, identify as many users as possible across them. Applications: social networks ontology reconciliation

15 Previous work Problem of reconciliation introduced by Novak et al.

16 Previous work Problem of reconciliation introduced by Novak et al. Two main approaches: - ML on user profile features (name, location, image)

17 Previous work Problem of reconciliation introduced by Novak et al. Two main approaches: - ML on user profile features (name, location, image) - ML on neighborhood topology

18 Previous work Problem of reconciliation introduced by Novak et al. Two main approaches: - ML on user profile features (name, location, image) - ML on neighborhood topology Limitations:

19 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack

20 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack

21 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack

22 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack

23 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack

24 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack

25 Previous work Very rich literature in de-anonymization Two relevant works: - Backstrom et al. propose an active and passive attack - Narayanan and Shmatikov successful de-anonymization attack

26 Narayanan and Shmatikov experiment Ground truth matching across the two social networks

27 Narayanan and Shmatikov experiment Ground truth matching across the two social networks 80 me-links

28 Narayanan and Shmatikov experiment Ground truth matching across the two social networks 80 me-links They could re-identify 30.8% of the mappings.

29 Narayanan and Shmatikov experiment Algorithm:

30 Narayanan and Shmatikov experiment Algorithm:?

31 Narayanan and Shmatikov experiment Algorithm: 2

32 Narayanan and Shmatikov experiment Algorithm:

33 Narayanan and Shmatikov experiment Algorithm:

34 Narayanan and Shmatikov experiment Algorithm:

35 Narayanan and Shmatikov experiment Algorithm: Why? Is it necessary to have high degree me-links?

36 Abstraction Input: two graphs and a set of trusted matching We want to maximize the number of final matches.

37 Is the problem tractable? Problem is similar to graph isomorphism

38 Is the problem tractable? Problem is similar to graph isomorphism Problem seems even harder because we want to detect similar structure

39 Is the problem tractable? Problem is similar to graph isomorphism Problem seems even harder because we want to detect similar structure

40 Abstraction Formalization of the problem: Underlying social network

41 Abstraction Formalization of the problem: Underlying social network p 1 p 2 independently Delete the edges

42 Abstraction Formalization of the problem: Underlying social network p 1 p 2 independently Delete the edges Initial matchings

43 Questions Having a constant fraction of me-links, can we reconcile the entire network? If we have k me-links which fraction of networks can we reconcile?

44 Underlying social network Without additional assumption on the underling network problem seems still very hard

45 Underlying social network Without additional assumption on the underling network problem seems still very hard We study two different models for social networks: - G(n,p) - Preferential attachment

46 Our algorithm Algorithm: Narayanan Shmatikov + degree bucketing + acceptance threshold

47 G(n,p) Does the technique works if the underlying graph is random? p 1 p p 2

48 G(n,p) Does the technique works if the underlying graph is random? p 1 p p 2 E[N G1 ( ) \ N G2 ( )] = (n 1)pp 1 p 2 E[N G1 ( ) \ N G2 ( )] = (n 2)p 2 p 1 p 2

49 Concentration We assume c log n n apple p apple 1 6,l,p 1,p 2 2 O(1) Two cases: npp 1 p 2 l 24 log n -, Chernoff bound is enough npp 1 p 2 l apple 24 log n -, we never make error x =(n 2)p 2 p 1 p 2 P = " X n # B i apple 2 i=1 =(1 x) n + nx(1 x) n 1 + n x 2 (1 x) n 2 =1 n 3 x 3 o(n 3 x 3 ) 2

50 More realistic model Preferential attachment: - G m 1 is a single node with self-loops m G m n G m n 1 m - adding a node to and edges with probability proportional to the current degrees

51 Preferential attachment A bit harder - Several nodes of constant degree, we need to have a cascade - Objective is reconcile a constant fraction of the network

52 Sketch of the proof For high degree node we can use concentration results.

53 Sketch of the proof For high degree node we can use concentration results. Different nodes of intermediate degree do not share many neighbors.

54 Sketch of the proof For high degree node we can use concentration results. Different nodes of intermediate degree do not share many neighbors. High degree nodes help to detect intermediate degree nodes that in turn help to detect small degree nodes.

55 PA structural lemmas High degree nodes are early birds. Nodes inserted after time n, for constant, have degree in o(log 2 n)

56 PA structural lemmas High degree nodes are early birds. n o(log 2 n) Nodes inserted after time, for constant, have degree in The rich get richer. log 2 n For nodes of degree greater than been inserted after time n, for constant a constant fraction of their neighbors has

57 PA structural lemmas High degree nodes are early birds. n o(log 2 n) Nodes inserted after time, for constant, have degree in The rich get richer. log 2 n For nodes of degree greater than been inserted after time n, for constant a constant fraction of their neighbors has First-mover advantage. All nodes inserted before time n 0.3, have degree at least log 3 n

58 High degree nodes are early birds G m 1 G m n

59 High degree nodes are early birds n G m 1 G m n

60 High degree nodes are early birds n G m 1 G m n n

61 High degree nodes are early birds n G m 1 G m n n Let d i be the degree at the beginning of a phase. The probability that a node increase its degree is dominated by the probability of an head in a coin toss for a biased coin that gives head with probability 3d i n

62 The rich get richer If at time n, the node has degree less than we are done n 1 2 d G m 1 G m n

63 The rich get richer If at time n, the node has degree less than we are done n 1 2 d G m 1 G m n The probability that the node increases its degree is dominated by the probability of an head in a coin toss for a biased coin that gives head with probability d 2nm

64 First-mover advantage From Cooper and Frieze result on the cover time of PA graphs, Pr D k = d nm (v 1 )+d nm (v 2 )+ + d nm (v k ) D k 2 p 2kn 3 p mn log mn apple (mn) 2 Pr(d n (v k+1 )=d +1 D k 2k = s) apple s + d 2N 2k s d Playing a bit with algebra we can get the final result.

65 Sketch of the proof For high degree node we can use concentration results. Different nodes of intermediate degree do not share many neighbors. High degree nodes help to detect intermediate degree nodes that in turn help to detect small degree nodes.

66 Matching high degree nodes E[N G1 ( ) \ N G2 ( )] = d(v)p 1 p 2 l By Chernoff N G1 ( ) \ N G2 ( ) 7 8 d(v)p 1p 2 l w.h.p.

67 Matching high degree nodes E[N G1 ( ) \ N G2 ( )] = d(v)p 1 p 2 l By Chernoff N ( ) \ N ( ) G1 G2 n 7 8 d(v)p 1p 2 l w.h.p. G m 1 G m n N G1 ( ) \ N G2 ( ) apple d(v)p 1 p 2 l + o(d(v))

68 Matching high degree nodes E[N G1 ( ) \ N G2 ( )] = d(v)p 1 p 2 l By Chernoff N ( ) \ N ( ) G1 G2 n 7 8 d(v)p 1p 2 l w.h.p. G m 1 G m n N G1 ( ) \ N G2 ( ) apple d(v)p 1 p 2 l + o(d(v)) has degree at most connecting to it is o(1) Õ( p n) and so the probability of

69 Matching high degree nodes E[N G1 ( ) \ N G2 ( )] = d(v)p 1 p 2 l By Chernoff N ( ) \ N ( ) G1 G2 n 7 8 d(v)p 1p 2 l w.h.p. G m 1 G m n N G1 ( ) \ N G2 ( ) apple d(v)p 1 p 2 l + o(d(v)) has degree at most connecting to it is o(1) Õ( p n) and so the probability of

70 Sketch of the proof For high degree node we can use concentration results. Different nodes of intermediate degree do not share many neighbors. High degree nodes help to detect intermediate degree nodes that in turn help to detect small degree nodes.

71 Bound the mismatch score n 0.3 G m 1 G m n

72 Bound the mismatch score n 0.3 n n ( 4 3) n ( 4 3) G m 1 G m n n a = n 0.3,n b = n ( 3 ) 3 n ( 3 2 )0.3 ( 3 )

73 Bound the mismatch score n 0.3 n n ( 4 3) n ( 4 3) G m 1 G m n n a = n 0.3,n b = n The probability that 3 nodes coming between n a and n b point to and n b 2 n b X i=n a n b X j=n a n b X k=n a log 3 2 n log 3 2 n log 3 2 n (i 1) (j 1) (k 1) n 2b 3a 2 o(1)

74 Sketch of the proof For high degree node we can use concentration results. Different nodes of intermediate degree do not share many neighbors. High degree nodes help to detect intermediate degree nodes that in turn help to detect small degree nodes.

75 Cascade n 0.3 G m 1 G m n

76 Cascade n 0.3 G m 1 G m n n 0.25 After one phase G m 1 G m n

77 Cascade n 0.3 G m 1 G m n n 0.25 After one phase G m 1 G m n in each phase we do not identify a small fraction, in total we loose a small constant G m 1 G m n

78 Cascade n 0.3 G m 1 G m n n 0.25 After one phase G m 1 G m n in each phase we do not identify a small fraction, in total we loose a small constant G m 1 G m n

79 Sketch of the proof For high degree node we can use concentration results. Different nodes of intermediate degree do not share many neighbors. High degree nodes help to detect intermediate degree nodes that in turn help to detect small degree nodes.

80 Results Theorem 1 If the underlying network is a G(n,p) graph it is possible to reconcile it completely Theorem 2 If the underlying network is a PA graph it is possible to reconcile it a large fraction of it.

81 Experimental results

82 Experiments Experiments on different graphs:

83 PA experiment Are our theoretical results robust?

84 Scalability How does the algorithm scale with the size of the graph?

85 Facebook experiment How does the algorithm perform if the underlying graph is a social network?

86 Facebook experiment How does the algorithm perform if the underlying graph is a social network? 80% recall!! Can we explain it in theory?

87 Facebook cascade experiment What does happen if we generate the underlying network using a cascade process? Recover almost all the graph in the intersection. Can we explain it in theory?

88 Affiliation network model What does happen if we delete all the edges inside a subset of the communities? More than 80% recall. Can we explain it in theory?

89 Reconcile different graphs DBLP: we generate two co-authorship graphs. One considering only publications in even years and the other publication only in odd years.

90 Reconcile different graphs DBLP: we generate two co-authorship graphs. One considering only publications in even years and the other publication only in odd years. Gowalla: we generate two co-checkin graphs. One considering only checkins in even years and the other checkins only in odd years.

91 Reconcile different graphs DBLP: we generate two co-authorship graphs. One considering only publications in even years and the other publication only in odd years. Gowalla: we generate two co-checkin graphs. One considering only checkins in even years and the other checkins only in odd years. German/French Wikipedia: we crawl the inter-languange links, we use few of them as seed and we check how many links we could recover.

92 Reconcile different graphs Recall for Wikipedia ~30%

93 Reconcile different graphs We have really good performance for high degree nodes

94 Open problems and future directions

95 Extensions Other model of underlying graphs Other model of generation of networks Adversarial underlying network, error in seed links

96 Limitation of the current model Users degree depend varies in different social networks How can we model this more general setting?

97 Better algorithm Currently exploring only direct neighborhood Can we design better algorithms?

98 Thanks!

Coupling of Scale-Free and Classical Random Graphs

Coupling of Scale-Free and Classical Random Graphs April 18, 2007 Introduction Consider a graph where we delete some nodes and look at the size of the largest component remaining. Just how robust are scale