Basic Assumptions. Chapter 4. Link Analysis & Authority Ranking

Size: px

Start display at page:

Download "Basic Assumptions. Chapter 4. Link Analysis & Authority Ranking"

Aldous Bridges
5 years ago
Views:

<is web> ISWeb Information Systems & Semantic Web University of Koblenz Landau Lin Analysis: Early Approaches Chapter Lin Analysis & Authority Raning Sergej Sizov Web Mining Summer term 007 <Nr.

1 <is web> ISWeb Information Systems & Semantic Web University of Koblenz Landau Lin Analysis: Early Approaches Chapter Lin Analysis & Authority Raning Sergej Sizov Web Mining Summer term 007 <Nr.> of 0 Basic Assumptions Hyperlins contain information about the human judgment of a site The more incoming lins to a site, the more it is judged important (Bray 996) The visibility of a site is measured by the number of other sites pointing to it The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Early Approaches () Early Approaches () (Mar 988) To calculate the score S of a document at vertex v S(v) = s(v) + Σ S(w) ch[v] w Є ch(v) v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v Limitation: - Require G to be a directed acyclic graph (DAG) - If v has a single lin to w, S(v) > S(w) - If v has a long path to w and s(v) < s(w), then S(v) > S (w) unreasonable (Marchiori 997) Hyper information should complement textual information to obtain the overall information S(v) = s(v) + h(v) - S(v): overall information - s(v): textual information - h(v): hyper information r(v, w) h(v) = Σ F S(w) w Є ch[v] - F: a fading constant, F Є (0, ) - r(v, w): the ran of w after sorting the children of v by S(w) a remedy of the previous approach (Mar 988) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Improving Precision by Authority Scores Goal: Higher raning of URLs with high authority regarding volume, significance, freshness, authenticity of content improve precision of search results Web Structure: Small Diameter Small World Phenomenon (Milgram 967) Studies on Internet Connectivity (999) Approaches (all interpreting the Web as a directed graph G): citation or impact ran (q) indegree (q) PageRan (by Lawrence Page) HITS algorithm (by Jon Kleinberg) Combining relevance and authority raning: by weighted sum with appropriate coefficients (Google) by initial relevance raning and iterative improvement via authority raning (HITS) Source: Bill Cheswic and Hal Burch, Source: KC Claffy, suggested small world phenomenon: low-diameter graph ( diameter = max {shortest path (x,y) nodes x and y} ) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6

2 Web Structure: Connected Components Study of Web Graph (Broder et al. 000) Web Structure: Power-Law Degrees Study of Web Graph (Broder et al. 000) SCC = strongly connected component Source: A.Z. Broder et al., WWW 000..strongly connected core tends to have small diameter power-law distributed degrees: P[degree=] ~ (/) α with α. for indegrees and α.7 for outdegrees Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 The Random Surfer Model Idea: incoming lins are endorsements & increase page authority, authority is higher if lins come from high-authority pages PageRan as Eigenvector of Stochastic Matrix A stochastic matrix is an n n matrixm with row sum Σ j=..n M ij = for each row i Random surfer follows a stochastic matrix Theorem: For every stochastic matrix M all Eigenvalues λ have the property λ and there is an Eigenvector x with Eigenvalue s.t. x 0 and x = Authority (page q) = stationary prob. of visiting q random wal: uniformly random choice of lins + random jumps Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Suggests power iteration x (i+) = M T x (i) But: real Web graph has sins, may be periodic, is not strongly connected Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 Random Surfer: Ran Sin Problem Ran Sin Problem In general, many Web pages have no inlins/outlins It results in dangling edges in the graph E.g. no parent ran 0 M T converges to a matrix whose last column is all zero no children no solution M T converges to zero matrix Google s PageRan Idea: incoming lins are endorsements & increase page authority, authority is higher if lins come from high-authority pages ε Authority (page q) = stationary prob. of visiting q random wal: uniformly random choice of lins + random jumps Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -

3 Marov Chains : sunny : cloudy : rainy p0 = 0.8 p p + 0. p p = 0. p p p = 0. p + 0. p p0 0.67, p = 0., p 0. p0 + p + p = state set: finite or infinite state transition prob s: p ij time: discrete or continuous state prob s in step t: p (t) i = P[S(t)=i] Marov property: P[S(t)=i S(0),..., S(t-)] = P[S(t)=i S(t-)] interested in stationary state probabilities: (t) (t ) p j : = lim pj = lim p p p j j = ppj pj = t t j guaranteed to exist for irreducible, aperiodic, finite Marov chains Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Marov Chains: Formal Definition A stochastic process is a family of random variables {X(t) t T}. T is called parameter space, and the domain M of X(t) is called state space. T and M can be discrete or continuous. A stochastic process is called Marov process if for every choice of t,..., t n+ from the parameter space and every choice of x,..., x n+ from the state space the following holds: P [ X( tn + ) = xn+ X( t ) = x X( t ) = x... X( tn ) = xn ] = P [ X ( tn + ) = xn+ X( tn ) = xn ] A Marov process with discrete state space is called Marov chain. A canonical choice of the state space are the natural numbers. Notation for Marov chains with discrete parameter space: X n rather than X(t n ) with n = 0,,,... Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Properties of Marov Chains with Discrete Parameter Space () Properties of Marov Chains with Discrete Parameter Space () The Marov chain Xn with discrete parameter space is homogeneous if the transition probabilities p ij := P[X n+ = j X n =i] are independent of n irreducible if every state is reachable from every other state with positive probability: P[ X n = j X0 = i ] > 0 for all i, j n= aperiodic if every state i has period, where the period of i is the greatest common divisor of all (recurrence) values n for which P[ X n = i X i for =,...,n X 0 = i ] > 0 The Marov chain Xn with discrete parameter space is positive recurrent if for every state i the recurrence probability is and the mean recurrence time is finite: P[ X n = i X i for =,...,n X 0 = i ] = n= n P[ X n = i X i for =,...,n X 0 = i ] < n= ergodic if it is homogeneous, irreducible, aperiodic, and positive recurrent. Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Results on Marov Chains with Discrete Parameter Space () Results on Marov Chains with Discrete Parameter Space () For the n-step transition probabilities ( n ) pij : = P [ X n = j X0 = i ] the following holds: ( n ) ( n ) ( ) pij = pi pj with p ij : = pi ( n l ) ( l ) = pi pj for l n ( n ) n in matrix notation: P = P For the state probabilities after n steps ( n ) π j : = P [ X n = j ] the following holds: ( n ) ( 0 ) ( n ) ( ) π j = πi pij with initial state probabilities π 0 i i (Chapmanin matrix notation: ( n ) ( 0 ) ( n ) Π = Π P Kolmogorov equation) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Every homogeneous, irreducible, aperiodic Marov chain with a finite number of states is positive recurrent and ergodic. For every ergodic Marov chain there exist ( n ) stationary state probabilities π j : = lim π j n These are independent of Π (0) and are the solutions of the following system of linear equations: π j = π i pij i π j = j in matrix notation: (with n row vector Π) for all j Π = Π P Π = (balance equations) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8

4 Page Ran as a Marov Chain Model Model a random wal of a Web surfer as follows: follow outgoing hyperlins with uniform probabilities perform random jump with probability ε ergodic Marov chain The PageRan of a URL is the stationary visiting probability of URL in the above Marov chain. Further generalizations have been studied (e.g. random wal with bac button etc.) Drawbac of Page ran method: Page ran is query-independent and orthogonal to relevance Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Example: Page Ran Computation ε = P = T T T T ( 0 ) ( ) ( ) ( ) Π 0. Π Π 0. Π T 0. 9 T ( ) ( ) Π π = 0. π π Π π = 0. π + 0. π π = 0. π π π + π + π = π 0.776, π 0.8, π 0.9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 HITS Algorithm: Hyperlin-Induced Topic Search () HITS Algorithm () Idea: Determine Find good content sources: Authorities (high indegree) good lin sources: Hubs (high outdegree) better authorities that have good hubs as predecessors better hubs that have good authorities as successors For Web graph G=(V,E) define for nodes p, q V authority score hub score x q = y p ( p,q ) E y p = x q ( p,q ) E Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - and Authority and hub scores in matrix notation: T x = A y Iteration with adjacency matrix A: T T y = A x x : = A y: = A A x T y : = A x : = A A y Using Linear Algebra, we can prove: x and y converge x and y are Eigenvectors of A T A and AA T, resp. Intuitive interpretation: ( auth ) T M : = A A is the cocitation matrix: M (auth) ij is the number of nodes that point to both i and j ( hub ) T M : = AA is the bibliographic-coupling matrix: M (hub) ij is the number of nodes to which both i and j point Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Implementation of the HITS Algorithm Example: HITS Algorithm ) Determine sufficient number (e.g. 0-00) of root pages via relevance raning (e.g. using tf*idf raning) ) Add all successors of root pages ) For each root page add up to d predecessors ) Compute iteratively the authority and hub scores of this base set (of typically pages) with initialization x q := y p := / base set and normalization after each iteration converges to principal Eigenvector (Eigenvector with largest Eigenvalue (in the case of multiplicity ) ) Return pages in descending order of authority scores (e.g. the 0 largest elements of vector x) root set base set Drawbac of HITS algorithm: relevance raning within root set is not considered Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -

5 HITS Improvements (Brarat and Henzinger,998) HITS problems ) The document can contain many identical lins to the same document in another host ) Lins are generated automatically (e.g. messages posted on newsgroups) Solutions ) Assign weight to identical multiple edges, which are inversely proportional to their multiplicity ) Prune irrelevant nodes or regulating the influence of a node with a relevance weight Improved HITS Algorithm Potential weaness of the HITS algorithm: irritating lins (automatically generated lins, spam, etc.) topic drift (e.g. from Jaguar car to car in general) Improvement: Introduce edge weights: 0 for lins within the same host, / with lins from URLs of the same host to URL (xweight) /m with m lins from URL to m URLs on the same host (yweight) Consider relevance weights w.r.t. query topic (e.g. tf*idf) Iterative computation of authority score hub score xq = y p * topic score( ( p,q ) E y p = xq * topic score( q )* ( p,q ) E p )* xweight( p,q ) yweight( p,q ) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 SALSA Forming a Bipartite Graph in SALSA SALSA (Lempel, Moran 00) Probabilistic extension of the HITS algorithm Random wal is carried out by following hyperlins both in the forward and in the bacward direction Two separate random wals Hub wal Authority wal Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Random Wals Hub wal Follow a Web lin from a page u h to a page w a (a forward lin) and then Immediately traverse a baclin going from w a to v h, where (u,w) Є E and (v,w) Є E Authority Wal Follow a Web lin from a page w(a) to a page u(h) (a bacward lin) and then Immediately traverse a forward lin going bac from v h to w a where (u,w) Є E and (v,w) Є E Lempel and Moran (00) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect. This effect occurs when a small collection of pages (related to a given topic) is connected so that every hub lins to every authority and includes as a special case the mutual reinforcement effect The pages in a community connected in this way can be raned highly by HITS, higher than pages in a much larger collection where only some hubs lin to some authorities Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 SALSA: Random Wal on Hubs and Authorities View each node v of the lin graph as two nodes v h and v a Construct bipartite undirected graph G (V,E ) from lin graph G(V,E): V = {v h v V and outdegree(v)>0} {v a v V and indegree(v)>0} E = {(v h, w a ) (v,w) E} Stochastic hub matrix H: hij = deg ree( i ) deg ree( ) for hubs i, j and ranging over all nodes with (i h, a ), ( a, j h ) E Stochastic authority matrix A: aij = deg ree( ia ) deg ree( h) for authorities i, j and ranging over all nodes with (i a, h ), ( h, j a ) E The corresponding Marov chains are ergodic on connected component The stationary solutions for these Marov chains are: π[v h ] ~ outdegree(v) for H and π[v a ] ~ indegree(v) for A Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 h a

6 Comparison and Extensions Literature contains plethora of variations on Page-Ran and HITS Key points are: mutual reinforcement between hubs and authorities re-scale edge weights (normalization) Unified notation (for lin graph with n nodes): L - n n lin matrix, L ij = if there is an edge (i,j), 0 else din - n vector with din i = indegree(i), Din n n = diag(din) dout - n vector with dout i = outdegree(i), Dout n n = diag(dout) x -n authority vector y -n hub vector Iop - operation applied to incoming lins Oop - operation applied to outgoing lins Unified Framewor for HITS, PageRan & Co HITS: x = Iop(y), y=oop(x) with Iop(y) = L T y, Oop(x) = Lx PageRan: x = Iop(x) with Iop(x) = P T x with P T = L T Dout - or P T = αl T Dout - + (-α) (/n) e e T SALSA (PageRan-style computation with mutual reinforcement): x = Iop(y) with Iop(y) = P T y with P T = L T Dout - y = Oop(x) with Oop(x) = Q x with Q = L Din - and other models of lin analysis can be cast into this framewor, too Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Lin Analysis: Stability Stability of HITS (Ng et al, 00) Whether the lin analysis algorithms based on eigenvectors are stable in the sense that results don t change significantly? The connectivity of a portion of the graph is changed arbitrary How will it affect the results of algorithms? A bound on the number of hyperlins that can added or deleted from one page without affecting the authority or hubness weights It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ λ d: maximum outdegree of G Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Stability of PageRan Lin Analysis: Advanced Issues Ng et al (00) V*: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role If the set of pages affected by the perturbation have a small ran, the overall change will also be small tighter bound by Bianchini et al (00) Topic-specific authority raning Personalized PageRan Personalization vectors Query logs Clic streams Efficiency issues Online Page Importance (OPIC) Time-aware lin analysis δ(j) >= depends on the edges incident on j Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 6

7 Web Models: Understanding Large Graphs What are the statistics of real life networs? Can we explain how the networs were generated? Small World Experiment (Six degreees of separation) Stanley Milgram: 967: Letters were handed out to people in Nebrasa to be sent to a target (stoc broer) in Boston People were instructed to pass on the letters to someone they new on first-name basis The letters that reached the destination followed paths of avg length. (i.e. around 6) Duncan Watts: 00: Milgram's experiment recreated on the internet using an message as the "pacage" that needed to be delivered, with 8,000 senders and 9 targets (in 7 countries). the avg number of intermediaries was also around 6. See also: The Kevin Bacon game The Erdös number etc. Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Kevin Bacon Experiment Craig Fass, Brian Turtle and Mie Ginelli: 99: motivated by Bacon's most recent movie The Air Up There and his career discussion Vertices: actors and actresses Edge between u and v if they appeared in a movie together Kevin Bacon No. of movies : 6 No. of actors : 8 Average separation:.79 Is Kevin Bacon the most connected actor? See also: Ran Name Average # of # of distance movies lins Rod Steiger.77 6 Donald Pleasence Martin Sheen Christopher Lee Robert Mitchum Charlton Heston Eddie Albert Robert Vaughn Donald Sutherland John Gielgud Anthony Quinn James Earl Jones Kevin Bacon Kevin Bacon Quantities of Interest Connected components: how many, and how large? Networ diameter: maximum (worst-case) or average? exclude infinite distances? (disconnected components) the small-world phenomenon Clustering: to what extent that lins tend to cluster locally? what is the balance between local and long-distance connections? what roles do the two types of lins play? Degree distribution: what is the typical degree in the networ? what is the overall distribution? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 Graph theory: undirected graph notation Graph theory: directed graph notation Graph G=(V,E) V = set of vertices E = set of edges Graph G=(V,E) V = set of vertices E = set of edges undirected graph E={(,),(,),(,),(,),(,)} directed graph E={,,,,,,,,,, } Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - 7

8 undirected graph: degree distribution Directed graph: in/outdegrees degree d(i) of node i number of edges incident on node i in-degree d in (i) of node i number of edges pointing to node i degree sequence [d(),d(),d(),d(),d()] [,,,,] degree distribution [(,),(,),(,)] out-degree d out (i) of node i number of edges leaving node i in-degree sequence [,,,,] out-degree sequence [,,,,0] Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Degree distributions Power-law distributions The degree distributions of most real-life networs follow a power law p() = C -α frequency f f = fraction of nodes with degree = probability of a randomly selected node to have degree Right-sewed/Heavy-tail distribution there is a non-negligible fraction of nodes that has very high degree (hubs) scale-free: no characteristic scale, average is not informative degree Problem: find the probability distribution that best fits the observed data highly concentrated around the mean the probability of very high degree nodes is exponentially small Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Power-law signature Power-law: Examples p() = C -α Power-law distribution gives a line in the log-log plot log p() = -α log + logc frequency log frequency α degree log degree α : power-law exponent (typically α ) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Taen from [Newman 00] Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 8

Collective Statistics (M. Newman 00) Web Structure: Power-Law Degrees Study of Web Graph (Broder et al. 000) power-law distributed degrees: P[degree=] ~ (/) α with α. for indegrees and α.

9 Collective Statistics (M. Newman 00) Web Structure: Power-Law Degrees Study of Web Graph (Broder et al. 000) power-law distributed degrees: P[degree=] ~ (/) α with α. for indegrees and α.7 for outdegrees Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 Clustering (Transitivity) coefficient Example Measures the density of triangles (local clusters) in the graph Two different ways to measure it: The ratio of the means C () = i triangles centered at node i i triples centered at node i C () = = Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Clustering (Transitivity) coefficient Example Clustering coefficient for node i triangles centered at node i C i = triples centered at node i C () = ( + + 6) = 0 The mean of the ratios () C = Ci n The two clustering coefficients give different measures C () increases with nodes with low degree C () = 8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - 9

z/n =O(/n) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning

starts and ends at the same node Graphs: shortest paths Shortest Path from node i to node j also nown as BFS path, or geodesic path Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin

10 Collective Statistics (M. Newman 00) Clustering coefficient for random graphs The probability of two of your neighbors also being neighbors is p, independent of local structure clustering coefficient C = p when z is fixed C = z/n =O(/n) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Graphs: paths Path from node i to node j: a sequence of edges (directed or undirected from node i to node j) path length: number of edges on the path nodes i and j are connected cycle: a path that starts and ends at the same node Graphs: shortest paths Shortest Path from node i to node j also nown as BFS path, or geodesic path Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Measuring the small world phenomenon Collective Statistics (M. Newman 00) d ij = shortest path between i and j Diameter: d = max d Characteristic path length: = i,j n(n-)/ ij d ij i> j Harmonic mean = n(n -)/ - d ij i> j Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -60 0

11 Degree correlations Collective Statistics (M. Newman 00) Do high degree nodes tend to lin to high degree nodes? Newman compute the correlation coefficient of the degrees x i, y i of the two endpoints of an edge i Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Graphs: fully connected cliques Clique K n A graph that has all possible n(n-)/ edges Directed Graph Strongly connected graph: there exists a path from every i to every j Wealy connected graph: If edges are made to be undirected the graph is connected Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Web Structure: Connected Components Summary: real networ properties Study of Web Graph (Broder et al. 000) Most nodes have only a small number of neighbors (degree), but there are some nodes with very high degree (power-law degree distribution) scale-free networs SCC = strongly connected component If a node x is connected to y and z, then y and z are liely to be connected high clustering coefficient Most nodes are just a few edges away on average. small world networs Source: A.Z. Broder et al., WWW 000..strongly connected core tends to have small diameter Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -66

12 A Canonical Natural Networ has Few connected components: often only or a small number, indep. of networ size Small diameter: often a constant independent of networ size (lie 6) or perhaps growing only logarithmically with networ size or even shrin? typically exclude infinite distances A high degree of clustering: considerably more so than for a random networ in tension with small diameter A heavy-tailed degree distribution: a small but reliable number of high-degree vertices often of power law form Measuring and modelling networ properties Around 999 Watts and Strogatz, Dynamics and small-world phenomenon Faloutsos, On power-law relationships of the Internet Topology Kleinberg et al., The Web as a graph Barabasi and Albert, The emergence of scaling in real networs Question: Is it possible that there is a unifying underlying generative process? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -67 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -68 Basics: the Erdös-Rényi Model: G(N,p) A Closely Related Model: G(N,m) All edges are equally probable and appear independently NW size N > and probability p: distribution G(N,p) each edge (u,v) chosen to appear with probability p N(N-)/ trials of a biased coin flip The usual regime of interest is when p ~ /N, N is large e.g. p = /N, p = /N, p = /N, p=0/n, p = log(n)/n, etc. in expectation, each vertex will have a small number of neighbors will then examine what happens when N infinity can thus study properties of large networs with bounded degree Degree distribution of a typical G drawn from G(N,p): draw G according to G(N,p); loo at a random vertex u in G what is Pr[deg(u) = ] for any fixed? Poisson distribution with mean l = p(n-) ~ pn Sharply concentrated; not heavy-tailed Especially easy to generate NWs from G(N,p) For any fixed m <= N(N-)/, define distribution G(N,m): choose uniformly at random from all graphs with exactly m edges G(N,m) is lie G(N,p) with p = m/(n(n-)/) ~ m/n^ this intuition can be made precise, and is correct if m = cn then p = c/(n-) ~ c/n mathematically tricier than G(N,p) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -69 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -70 Evolution of a Random Networ We have a large number n of vertices We start randomly adding edges one at a time At what time t will the networ: have at least one large connected component? have a single connected component? have small diameter? have a large clique? Random graph models G(N,p) and G(N,m): Recap Model G(N,p): select each of the possible edges independently with prob. p expected total number of edges is pn(n-)/ expected degree of a vertex is p(n-) degree will obey a Poisson distribution (not heavy-tailed) Model G(N,m): select exactly m of the N(N-)/ edges to appear all sets of m edges equally liely How gradually or suddenly do these properties appear? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7

Random graphs degree distributions The G(N,p) degree distribution follows a binomial n ( ) n p() = B(n;;p) = p p Assuming z=np is fixed, as n B(n,,p) is approximated by a Poisson distribution z z p()

13 Random graphs degree distributions The G(N,p) degree distribution follows a binomial n ( ) n p() = B(n;;p) = p p Assuming z=np is fixed, as n B(n,,p) is approximated by a Poisson distribution z z p() = P(;z) = e! Highly concentrated around the mean, with a tail that drops exponentially Monotone Networ Properties Threshold or tipping for (say) connectivity: fewer than m(n) edges graph almost certainly not connected more than m(n) edges graph almost certainly is connected made formal by examining limit as N infinity Often interested in monotone graph properties: let G have the property add edges to G to obtain G then G must have the property also Examples: G is connected G has diameter <= d (not exactly d) G has a clique of size >= (not exactly ) d, may depend on NW size N (How?) Difficult to study emergence of non-monotone properties as the number of edges is increased - what would it mean? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Random Graphs: Thresholds for Monotone Properties Random Graphs: Which Properties Tip ()? Consider Erdos-Renyi G(N,m) model select m edges at random to include in G Let P be some monotone property of graphs P(G) = G has the property P(G) = 0 G does not have the property Let m(n) be some function of NW size N formalize idea that property P appears suddenly at m(n) edges Say that m(n) is a threshold function for P if: let m (N) be any function of N loo at ratio r(n) = m (N)/m(N) as N infinity if r(n) 0: probability that P(G) = in G(N,m (N)): 0 if r(n) infinity: probability that P(G) = in G(N,m (N)): A purely structural definition of tipping tipping results from incremental increase in connectivity Connected component of size > N/: threshold function is m(n) = N/ (or p ~ /N) note: full connectivity impossible Fully connected: threshold function is m(n) = (N/)log(N) (or p ~ log(n)/n) NW remains extremely sparse: only ~ log(n) edges per vertex Small diameter: threshold is m(n) ~ N^(/) for diameter (or p ~ /sqrt(n)) fraction of possible edges still ~ /sqrt(n) 0 generate very small worlds Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -76 Erdos-Renyi Summary Erdos-Renyi: Clustering Coefficient A model in which all connections are equally liely each of the N(N-)/ edges chosen randomly & independently As we add edges, a precise sequence of events unfolds: graph acquires a giant component graph becomes connected graph acquires small diameter Many properties appear very suddenly (tipping, thresholds) All statements are mathematically precise But is this how natural networs form? If not, which aspects are unrealistic? may all edges are not equally liely! Generate a networ G according to G(N,p) Examine a typical vertex u in G choose u at random among all vertices in G what do we expect c(u) to be? Answer: exactly p! In G(N,m), expect c(u) to be m/n(n-) Both cases: c(u) entirely determined by overall density Baseline for comparison with more clustered models Erdos-Renyi has no bias towards clustered or local edges Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -77 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -78

Watt s models: Caveman and Solaria Erdos-Renyi: sharing a common neighbor maes two vertices no more liely to be directly connected than two very distant vertices every edge appears entirely

companies selling goods to a third are in related industries Watts Caveman world: overall density of edges is low but two vertices with a common neighbor are liely connected Watts Solaria world

14 Watt s models: Caveman and Solaria Erdos-Renyi: sharing a common neighbor maes two vertices no more liely to be directly connected than two very distant vertices every edge appears entirely independently of existing structure But in many settings, the opposite is true: you tend to meet new friends through your old friends two web pages pointing to a third might share a topic two companies selling goods to a third are in related industries Watts Caveman world: overall density of edges is low but two vertices with a common neighbor are liely connected Watts Solaria world overall density of edges low; no special bias towards local edges lie Erdos-Renyi Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -79 Maing it (Somewhat) Precise: the α-model The α-model has the following parameters or nobs : N: size of the networ to be generated : the average degree of a vertex in the networ to be generated p: the default probability two vertices are connected α: adjustable parameter dictating bias towards local connections For any vertices u and v: define m(u,v) to be the number of common neighbors (so far) Key quantity: the propensity R(u,v) of u to connect to v if m(u,v) >=, R(u,v) = (share too many friends no to connect) if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect) else, R(u,v) = p + (m(u,v)/)^α (-p) Generate NW incrementally using R(u,v) as the edge probability; details omitted Note: α = infinity is lie Erdos-Renyi (but not exactly) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -80 Watts-Strogatz Model Small Worlds and Occam s Razor For small α, should generate large clustering coefficients we programmed the model to do so Watts claims that proving precise statements is hard But we do not want a new model for every little property Erdos-Renyi small diameter α-model high clustering coefficient In the interests of Occam s Razor, we would lie to find a single, simple model of networ generation that simultaneously captures many properties Watt s small world: small diameter and high clustering C(p) : clustering coeff. L(p) : average path length (Watts and Strogatz, Nature 9, 0 (998)) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Meanwhile, Bac in the Real World World Wide Web Watts examines three real networs as case studies: the Kevin Bacon graph the Western states power grid the nervous system For each of these networs, he: computes its size, diameter, and clustering coefficient compares diameter and clustering to best Erdos-Renyi approximate shows that the best α-model approximation is better important to be fair to each model by finding best fit Overall moral: if we care only about diameter and clustering, α is better than p Expected Result ~ 6 P(=00) ~ 0-99 N WWW ~ 0 9 N(=00)~0-90 Real Result γ out =. γ in =. P out () ~ -γout P(=00) ~ 0-6 P in () ~ - γin N WWW ~ 0 9 N(=00) ~ 0 J. Kleinberg, et. al, Proceedings of the ICCC (999) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8

publication of new papers The attachment is not uniform A node is lined with higher probability to a node that already has a large number of lins WWW: new documents lin to well nown sites (CNN,

Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -86 Scale-Free Networs: the Barabasi-Albert model Growth: Starting with a small number (m 0 ) of

15 What does that mean? Scale-free Networs Poisson distribution Power-law distribution The number of nodes (N) is not fixed Networs continuously expand by additional new nodes WWW: addition of new nodes Citation: publication of new papers The attachment is not uniform A node is lined with higher probability to a node that already has a large number of lins WWW: new documents lin to well nown sites (CNN, Yahoo, Google) Citation: Well cited papers are more liely to be cited again Exponential Networ Scale-free Networ Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -86 Scale-Free Networs: the Barabasi-Albert model Growth: Starting with a small number (m 0 ) of nodes, at every time step, we add a new node with m(<< m 0 ) edges that lin the new node to m different nodes already present in the system. Preferential attachment: The probability Π that a new node will be connected to node i depends on the degree i of node i, such that Scale-Free Networs: Properties Preferential attachment explains heavy-tailed degree distributions small diameter (~log(n), via hubs ) Small average path length: After t time-steps the networ has N = t+m 0 nodes and m*t edges Vertices j with high degree are liely to get more lins ( Rich get richer ) The networ evolves into a stationary scale-free state with Node degree correlations: The dynamical process that creates a scale free networ builds up nontrivial correlations between the degrees of connected nodes. Clustering coefficient is ~ times larger that that of a random graph. Will not generate high clustering coefficient as with Watts-Strogatz model. Clustering coefficient decreases with networ size, following approximately a power law i.e. no bias towards local connectivity, but towards hubs Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -87 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -88 Networ robustness The accidental failure of a number of nodes in a random networ can fracture the system into noncommunicating islands. Scale-free networs are more robust in the face of such failures. Scale-free networs are highly vulnerable to a coordinated attac against their hubs. Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -89

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS LINK ANALYSIS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis Models