Basic Assumptions. Chapter 4. Link Analysis & Authority Ranking

Size: px
Start display at page:

Download "Basic Assumptions. Chapter 4. Link Analysis & Authority Ranking"

Transcription

1 <is web> ISWeb Information Systems & Semantic Web University of Koblenz Landau Lin Analysis: Early Approaches Chapter Lin Analysis & Authority Raning Sergej Sizov Web Mining Summer term 007 <Nr.> of 0 Basic Assumptions Hyperlins contain information about the human judgment of a site The more incoming lins to a site, the more it is judged important (Bray 996) The visibility of a site is measured by the number of other sites pointing to it The luminosity of a site is measured by the number of other sites to which it points Limitation: failure to capture the relative importance of different parents (children) sites Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Early Approaches () Early Approaches () (Mar 988) To calculate the score S of a document at vertex v S(v) = s(v) + Σ S(w) ch[v] w Є ch(v) v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v Limitation: - Require G to be a directed acyclic graph (DAG) - If v has a single lin to w, S(v) > S(w) - If v has a long path to w and s(v) < s(w), then S(v) > S (w) unreasonable (Marchiori 997) Hyper information should complement textual information to obtain the overall information S(v) = s(v) + h(v) - S(v): overall information - s(v): textual information - h(v): hyper information r(v, w) h(v) = Σ F S(w) w Є ch[v] - F: a fading constant, F Є (0, ) - r(v, w): the ran of w after sorting the children of v by S(w) a remedy of the previous approach (Mar 988) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Improving Precision by Authority Scores Goal: Higher raning of URLs with high authority regarding volume, significance, freshness, authenticity of content improve precision of search results Web Structure: Small Diameter Small World Phenomenon (Milgram 967) Studies on Internet Connectivity (999) Approaches (all interpreting the Web as a directed graph G): citation or impact ran (q) indegree (q) PageRan (by Lawrence Page) HITS algorithm (by Jon Kleinberg) Combining relevance and authority raning: by weighted sum with appropriate coefficients (Google) by initial relevance raning and iterative improvement via authority raning (HITS) Source: Bill Cheswic and Hal Burch, Source: KC Claffy, suggested small world phenomenon: low-diameter graph ( diameter = max {shortest path (x,y) nodes x and y} ) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6

2 Web Structure: Connected Components Study of Web Graph (Broder et al. 000) Web Structure: Power-Law Degrees Study of Web Graph (Broder et al. 000) SCC = strongly connected component Source: A.Z. Broder et al., WWW 000..strongly connected core tends to have small diameter power-law distributed degrees: P[degree=] ~ (/) α with α. for indegrees and α.7 for outdegrees Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 The Random Surfer Model Idea: incoming lins are endorsements & increase page authority, authority is higher if lins come from high-authority pages PageRan as Eigenvector of Stochastic Matrix A stochastic matrix is an n n matrixm with row sum Σ j=..n M ij = for each row i Random surfer follows a stochastic matrix Theorem: For every stochastic matrix M all Eigenvalues λ have the property λ and there is an Eigenvector x with Eigenvalue s.t. x 0 and x = Authority (page q) = stationary prob. of visiting q random wal: uniformly random choice of lins + random jumps Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Suggests power iteration x (i+) = M T x (i) But: real Web graph has sins, may be periodic, is not strongly connected Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 Random Surfer: Ran Sin Problem Ran Sin Problem In general, many Web pages have no inlins/outlins It results in dangling edges in the graph E.g. no parent ran 0 M T converges to a matrix whose last column is all zero no children no solution M T converges to zero matrix Google s PageRan Idea: incoming lins are endorsements & increase page authority, authority is higher if lins come from high-authority pages ε Authority (page q) = stationary prob. of visiting q random wal: uniformly random choice of lins + random jumps Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -

3 Marov Chains : sunny : cloudy : rainy p0 = 0.8 p p + 0. p p = 0. p p p = 0. p + 0. p p0 0.67, p = 0., p 0. p0 + p + p = state set: finite or infinite state transition prob s: p ij time: discrete or continuous state prob s in step t: p (t) i = P[S(t)=i] Marov property: P[S(t)=i S(0),..., S(t-)] = P[S(t)=i S(t-)] interested in stationary state probabilities: (t) (t ) p j : = lim pj = lim p p p j j = ppj pj = t t j guaranteed to exist for irreducible, aperiodic, finite Marov chains Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Marov Chains: Formal Definition A stochastic process is a family of random variables {X(t) t T}. T is called parameter space, and the domain M of X(t) is called state space. T and M can be discrete or continuous. A stochastic process is called Marov process if for every choice of t,..., t n+ from the parameter space and every choice of x,..., x n+ from the state space the following holds: P [ X( tn + ) = xn+ X( t ) = x X( t ) = x... X( tn ) = xn ] = P [ X ( tn + ) = xn+ X( tn ) = xn ] A Marov process with discrete state space is called Marov chain. A canonical choice of the state space are the natural numbers. Notation for Marov chains with discrete parameter space: X n rather than X(t n ) with n = 0,,,... Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Properties of Marov Chains with Discrete Parameter Space () Properties of Marov Chains with Discrete Parameter Space () The Marov chain Xn with discrete parameter space is homogeneous if the transition probabilities p ij := P[X n+ = j X n =i] are independent of n irreducible if every state is reachable from every other state with positive probability: P[ X n = j X0 = i ] > 0 for all i, j n= aperiodic if every state i has period, where the period of i is the greatest common divisor of all (recurrence) values n for which P[ X n = i X i for =,...,n X 0 = i ] > 0 The Marov chain Xn with discrete parameter space is positive recurrent if for every state i the recurrence probability is and the mean recurrence time is finite: P[ X n = i X i for =,...,n X 0 = i ] = n= n P[ X n = i X i for =,...,n X 0 = i ] < n= ergodic if it is homogeneous, irreducible, aperiodic, and positive recurrent. Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Results on Marov Chains with Discrete Parameter Space () Results on Marov Chains with Discrete Parameter Space () For the n-step transition probabilities ( n ) pij : = P [ X n = j X0 = i ] the following holds: ( n ) ( n ) ( ) pij = pi pj with p ij : = pi ( n l ) ( l ) = pi pj for l n ( n ) n in matrix notation: P = P For the state probabilities after n steps ( n ) π j : = P [ X n = j ] the following holds: ( n ) ( 0 ) ( n ) ( ) π j = πi pij with initial state probabilities π 0 i i (Chapmanin matrix notation: ( n ) ( 0 ) ( n ) Π = Π P Kolmogorov equation) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Every homogeneous, irreducible, aperiodic Marov chain with a finite number of states is positive recurrent and ergodic. For every ergodic Marov chain there exist ( n ) stationary state probabilities π j : = lim π j n These are independent of Π (0) and are the solutions of the following system of linear equations: π j = π i pij i π j = j in matrix notation: (with n row vector Π) for all j Π = Π P Π = (balance equations) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8

4 Page Ran as a Marov Chain Model Model a random wal of a Web surfer as follows: follow outgoing hyperlins with uniform probabilities perform random jump with probability ε ergodic Marov chain The PageRan of a URL is the stationary visiting probability of URL in the above Marov chain. Further generalizations have been studied (e.g. random wal with bac button etc.) Drawbac of Page ran method: Page ran is query-independent and orthogonal to relevance Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Example: Page Ran Computation ε = P = T T T T ( 0 ) ( ) ( ) ( ) Π 0. Π Π 0. Π T 0. 9 T ( ) ( ) Π π = 0. π π Π π = 0. π + 0. π π = 0. π π π + π + π = π 0.776, π 0.8, π 0.9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 HITS Algorithm: Hyperlin-Induced Topic Search () HITS Algorithm () Idea: Determine Find good content sources: Authorities (high indegree) good lin sources: Hubs (high outdegree) better authorities that have good hubs as predecessors better hubs that have good authorities as successors For Web graph G=(V,E) define for nodes p, q V authority score hub score x q = y p ( p,q ) E y p = x q ( p,q ) E Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - and Authority and hub scores in matrix notation: T x = A y Iteration with adjacency matrix A: T T y = A x x : = A y: = A A x T y : = A x : = A A y Using Linear Algebra, we can prove: x and y converge x and y are Eigenvectors of A T A and AA T, resp. Intuitive interpretation: ( auth ) T M : = A A is the cocitation matrix: M (auth) ij is the number of nodes that point to both i and j ( hub ) T M : = AA is the bibliographic-coupling matrix: M (hub) ij is the number of nodes to which both i and j point Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Implementation of the HITS Algorithm Example: HITS Algorithm ) Determine sufficient number (e.g. 0-00) of root pages via relevance raning (e.g. using tf*idf raning) ) Add all successors of root pages ) For each root page add up to d predecessors ) Compute iteratively the authority and hub scores of this base set (of typically pages) with initialization x q := y p := / base set and normalization after each iteration converges to principal Eigenvector (Eigenvector with largest Eigenvalue (in the case of multiplicity ) ) Return pages in descending order of authority scores (e.g. the 0 largest elements of vector x) root set base set Drawbac of HITS algorithm: relevance raning within root set is not considered Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -

5 HITS Improvements (Brarat and Henzinger,998) HITS problems ) The document can contain many identical lins to the same document in another host ) Lins are generated automatically (e.g. messages posted on newsgroups) Solutions ) Assign weight to identical multiple edges, which are inversely proportional to their multiplicity ) Prune irrelevant nodes or regulating the influence of a node with a relevance weight Improved HITS Algorithm Potential weaness of the HITS algorithm: irritating lins (automatically generated lins, spam, etc.) topic drift (e.g. from Jaguar car to car in general) Improvement: Introduce edge weights: 0 for lins within the same host, / with lins from URLs of the same host to URL (xweight) /m with m lins from URL to m URLs on the same host (yweight) Consider relevance weights w.r.t. query topic (e.g. tf*idf) Iterative computation of authority score hub score xq = y p * topic score( ( p,q ) E y p = xq * topic score( q )* ( p,q ) E p )* xweight( p,q ) yweight( p,q ) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 SALSA Forming a Bipartite Graph in SALSA SALSA (Lempel, Moran 00) Probabilistic extension of the HITS algorithm Random wal is carried out by following hyperlins both in the forward and in the bacward direction Two separate random wals Hub wal Authority wal Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Random Wals Hub wal Follow a Web lin from a page u h to a page w a (a forward lin) and then Immediately traverse a baclin going from w a to v h, where (u,w) Є E and (v,w) Є E Authority Wal Follow a Web lin from a page w(a) to a page u(h) (a bacward lin) and then Immediately traverse a forward lin going bac from v h to w a where (u,w) Є E and (v,w) Є E Lempel and Moran (00) showed theoretically that SALSA weights are more robust that HITS weights in the presence of the Tightly Knit Community (TKC) Effect. This effect occurs when a small collection of pages (related to a given topic) is connected so that every hub lins to every authority and includes as a special case the mutual reinforcement effect The pages in a community connected in this way can be raned highly by HITS, higher than pages in a much larger collection where only some hubs lin to some authorities Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 SALSA: Random Wal on Hubs and Authorities View each node v of the lin graph as two nodes v h and v a Construct bipartite undirected graph G (V,E ) from lin graph G(V,E): V = {v h v V and outdegree(v)>0} {v a v V and indegree(v)>0} E = {(v h, w a ) (v,w) E} Stochastic hub matrix H: hij = deg ree( i ) deg ree( ) for hubs i, j and ranging over all nodes with (i h, a ), ( a, j h ) E Stochastic authority matrix A: aij = deg ree( ia ) deg ree( h) for authorities i, j and ranging over all nodes with (i a, h ), ( h, j a ) E The corresponding Marov chains are ergodic on connected component The stationary solutions for these Marov chains are: π[v h ] ~ outdegree(v) for H and π[v a ] ~ indegree(v) for A Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 h a

6 Comparison and Extensions Literature contains plethora of variations on Page-Ran and HITS Key points are: mutual reinforcement between hubs and authorities re-scale edge weights (normalization) Unified notation (for lin graph with n nodes): L - n n lin matrix, L ij = if there is an edge (i,j), 0 else din - n vector with din i = indegree(i), Din n n = diag(din) dout - n vector with dout i = outdegree(i), Dout n n = diag(dout) x -n authority vector y -n hub vector Iop - operation applied to incoming lins Oop - operation applied to outgoing lins Unified Framewor for HITS, PageRan & Co HITS: x = Iop(y), y=oop(x) with Iop(y) = L T y, Oop(x) = Lx PageRan: x = Iop(x) with Iop(x) = P T x with P T = L T Dout - or P T = αl T Dout - + (-α) (/n) e e T SALSA (PageRan-style computation with mutual reinforcement): x = Iop(y) with Iop(y) = P T y with P T = L T Dout - y = Oop(x) with Oop(x) = Q x with Q = L Din - and other models of lin analysis can be cast into this framewor, too Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Lin Analysis: Stability Stability of HITS (Ng et al, 00) Whether the lin analysis algorithms based on eigenvectors are stable in the sense that results don t change significantly? The connectivity of a portion of the graph is changed arbitrary How will it affect the results of algorithms? A bound on the number of hyperlins that can added or deleted from one page without affecting the authority or hubness weights It is possible to perturb a symmetric matrix by a quantity that grows as δ that produces a constant perturbation of the dominant eigenvector δ: eigengap λ λ d: maximum outdegree of G Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Stability of PageRan Lin Analysis: Advanced Issues Ng et al (00) V*: the set of vertices touched by the perturbation The parameter ε of the mixture model has a stabilization role If the set of pages affected by the perturbation have a small ran, the overall change will also be small tighter bound by Bianchini et al (00) Topic-specific authority raning Personalized PageRan Personalization vectors Query logs Clic streams Efficiency issues Online Page Importance (OPIC) Time-aware lin analysis δ(j) >= depends on the edges incident on j Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 6

7 Web Models: Understanding Large Graphs What are the statistics of real life networs? Can we explain how the networs were generated? Small World Experiment (Six degreees of separation) Stanley Milgram: 967: Letters were handed out to people in Nebrasa to be sent to a target (stoc broer) in Boston People were instructed to pass on the letters to someone they new on first-name basis The letters that reached the destination followed paths of avg length. (i.e. around 6) Duncan Watts: 00: Milgram's experiment recreated on the internet using an message as the "pacage" that needed to be delivered, with 8,000 senders and 9 targets (in 7 countries). the avg number of intermediaries was also around 6. See also: The Kevin Bacon game The Erdös number etc. Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Kevin Bacon Experiment Craig Fass, Brian Turtle and Mie Ginelli: 99: motivated by Bacon's most recent movie The Air Up There and his career discussion Vertices: actors and actresses Edge between u and v if they appeared in a movie together Kevin Bacon No. of movies : 6 No. of actors : 8 Average separation:.79 Is Kevin Bacon the most connected actor? See also: Ran Name Average # of # of distance movies lins Rod Steiger.77 6 Donald Pleasence Martin Sheen Christopher Lee Robert Mitchum Charlton Heston Eddie Albert Robert Vaughn Donald Sutherland John Gielgud Anthony Quinn James Earl Jones Kevin Bacon Kevin Bacon Quantities of Interest Connected components: how many, and how large? Networ diameter: maximum (worst-case) or average? exclude infinite distances? (disconnected components) the small-world phenomenon Clustering: to what extent that lins tend to cluster locally? what is the balance between local and long-distance connections? what roles do the two types of lins play? Degree distribution: what is the typical degree in the networ? what is the overall distribution? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 Graph theory: undirected graph notation Graph theory: directed graph notation Graph G=(V,E) V = set of vertices E = set of edges Graph G=(V,E) V = set of vertices E = set of edges undirected graph E={(,),(,),(,),(,),(,)} directed graph E={,,,,,,,,,, } Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - 7

8 undirected graph: degree distribution Directed graph: in/outdegrees degree d(i) of node i number of edges incident on node i in-degree d in (i) of node i number of edges pointing to node i degree sequence [d(),d(),d(),d(),d()] [,,,,] degree distribution [(,),(,),(,)] out-degree d out (i) of node i number of edges leaving node i in-degree sequence [,,,,] out-degree sequence [,,,,0] Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Degree distributions Power-law distributions The degree distributions of most real-life networs follow a power law p() = C -α frequency f f = fraction of nodes with degree = probability of a randomly selected node to have degree Right-sewed/Heavy-tail distribution there is a non-negligible fraction of nodes that has very high degree (hubs) scale-free: no characteristic scale, average is not informative degree Problem: find the probability distribution that best fits the observed data highly concentrated around the mean the probability of very high degree nodes is exponentially small Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Power-law signature Power-law: Examples p() = C -α Power-law distribution gives a line in the log-log plot log p() = -α log + logc frequency log frequency α degree log degree α : power-law exponent (typically α ) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Taen from [Newman 00] Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 8

9 Collective Statistics (M. Newman 00) Web Structure: Power-Law Degrees Study of Web Graph (Broder et al. 000) power-law distributed degrees: P[degree=] ~ (/) α with α. for indegrees and α.7 for outdegrees Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -0 Clustering (Transitivity) coefficient Example Measures the density of triangles (local clusters) in the graph Two different ways to measure it: The ratio of the means C () = i triangles centered at node i i triples centered at node i C () = = Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Clustering (Transitivity) coefficient Example Clustering coefficient for node i triangles centered at node i C i = triples centered at node i C () = ( + + 6) = 0 The mean of the ratios () C = Ci n The two clustering coefficients give different measures C () increases with nodes with low degree C () = 8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - 9

10 Collective Statistics (M. Newman 00) Clustering coefficient for random graphs The probability of two of your neighbors also being neighbors is p, independent of local structure clustering coefficient C = p when z is fixed C = z/n =O(/n) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning - Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Graphs: paths Path from node i to node j: a sequence of edges (directed or undirected from node i to node j) path length: number of edges on the path nodes i and j are connected cycle: a path that starts and ends at the same node Graphs: shortest paths Shortest Path from node i to node j also nown as BFS path, or geodesic path Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Measuring the small world phenomenon Collective Statistics (M. Newman 00) d ij = shortest path between i and j Diameter: d = max d Characteristic path length: = i,j n(n-)/ ij d ij i> j Harmonic mean = n(n -)/ - d ij i> j Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -9 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -60 0

11 Degree correlations Collective Statistics (M. Newman 00) Do high degree nodes tend to lin to high degree nodes? Newman compute the correlation coefficient of the degrees x i, y i of the two endpoints of an edge i Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Graphs: fully connected cliques Clique K n A graph that has all possible n(n-)/ edges Directed Graph Strongly connected graph: there exists a path from every i to every j Wealy connected graph: If edges are made to be undirected the graph is connected Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Web Structure: Connected Components Summary: real networ properties Study of Web Graph (Broder et al. 000) Most nodes have only a small number of neighbors (degree), but there are some nodes with very high degree (power-law degree distribution) scale-free networs SCC = strongly connected component If a node x is connected to y and z, then y and z are liely to be connected high clustering coefficient Most nodes are just a few edges away on average. small world networs Source: A.Z. Broder et al., WWW 000..strongly connected core tends to have small diameter Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -6 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -66

12 A Canonical Natural Networ has Few connected components: often only or a small number, indep. of networ size Small diameter: often a constant independent of networ size (lie 6) or perhaps growing only logarithmically with networ size or even shrin? typically exclude infinite distances A high degree of clustering: considerably more so than for a random networ in tension with small diameter A heavy-tailed degree distribution: a small but reliable number of high-degree vertices often of power law form Measuring and modelling networ properties Around 999 Watts and Strogatz, Dynamics and small-world phenomenon Faloutsos, On power-law relationships of the Internet Topology Kleinberg et al., The Web as a graph Barabasi and Albert, The emergence of scaling in real networs Question: Is it possible that there is a unifying underlying generative process? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -67 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -68 Basics: the Erdös-Rényi Model: G(N,p) A Closely Related Model: G(N,m) All edges are equally probable and appear independently NW size N > and probability p: distribution G(N,p) each edge (u,v) chosen to appear with probability p N(N-)/ trials of a biased coin flip The usual regime of interest is when p ~ /N, N is large e.g. p = /N, p = /N, p = /N, p=0/n, p = log(n)/n, etc. in expectation, each vertex will have a small number of neighbors will then examine what happens when N infinity can thus study properties of large networs with bounded degree Degree distribution of a typical G drawn from G(N,p): draw G according to G(N,p); loo at a random vertex u in G what is Pr[deg(u) = ] for any fixed? Poisson distribution with mean l = p(n-) ~ pn Sharply concentrated; not heavy-tailed Especially easy to generate NWs from G(N,p) For any fixed m <= N(N-)/, define distribution G(N,m): choose uniformly at random from all graphs with exactly m edges G(N,m) is lie G(N,p) with p = m/(n(n-)/) ~ m/n^ this intuition can be made precise, and is correct if m = cn then p = c/(n-) ~ c/n mathematically tricier than G(N,p) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -69 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -70 Evolution of a Random Networ We have a large number n of vertices We start randomly adding edges one at a time At what time t will the networ: have at least one large connected component? have a single connected component? have small diameter? have a large clique? Random graph models G(N,p) and G(N,m): Recap Model G(N,p): select each of the possible edges independently with prob. p expected total number of edges is pn(n-)/ expected degree of a vertex is p(n-) degree will obey a Poisson distribution (not heavy-tailed) Model G(N,m): select exactly m of the N(N-)/ edges to appear all sets of m edges equally liely How gradually or suddenly do these properties appear? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7

13 Random graphs degree distributions The G(N,p) degree distribution follows a binomial n ( ) n p() = B(n;;p) = p p Assuming z=np is fixed, as n B(n,,p) is approximated by a Poisson distribution z z p() = P(;z) = e! Highly concentrated around the mean, with a tail that drops exponentially Monotone Networ Properties Threshold or tipping for (say) connectivity: fewer than m(n) edges graph almost certainly not connected more than m(n) edges graph almost certainly is connected made formal by examining limit as N infinity Often interested in monotone graph properties: let G have the property add edges to G to obtain G then G must have the property also Examples: G is connected G has diameter <= d (not exactly d) G has a clique of size >= (not exactly ) d, may depend on NW size N (How?) Difficult to study emergence of non-monotone properties as the number of edges is increased - what would it mean? Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Random Graphs: Thresholds for Monotone Properties Random Graphs: Which Properties Tip ()? Consider Erdos-Renyi G(N,m) model select m edges at random to include in G Let P be some monotone property of graphs P(G) = G has the property P(G) = 0 G does not have the property Let m(n) be some function of NW size N formalize idea that property P appears suddenly at m(n) edges Say that m(n) is a threshold function for P if: let m (N) be any function of N loo at ratio r(n) = m (N)/m(N) as N infinity if r(n) 0: probability that P(G) = in G(N,m (N)): 0 if r(n) infinity: probability that P(G) = in G(N,m (N)): A purely structural definition of tipping tipping results from incremental increase in connectivity Connected component of size > N/: threshold function is m(n) = N/ (or p ~ /N) note: full connectivity impossible Fully connected: threshold function is m(n) = (N/)log(N) (or p ~ log(n)/n) NW remains extremely sparse: only ~ log(n) edges per vertex Small diameter: threshold is m(n) ~ N^(/) for diameter (or p ~ /sqrt(n)) fraction of possible edges still ~ /sqrt(n) 0 generate very small worlds Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -7 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -76 Erdos-Renyi Summary Erdos-Renyi: Clustering Coefficient A model in which all connections are equally liely each of the N(N-)/ edges chosen randomly & independently As we add edges, a precise sequence of events unfolds: graph acquires a giant component graph becomes connected graph acquires small diameter Many properties appear very suddenly (tipping, thresholds) All statements are mathematically precise But is this how natural networs form? If not, which aspects are unrealistic? may all edges are not equally liely! Generate a networ G according to G(N,p) Examine a typical vertex u in G choose u at random among all vertices in G what do we expect c(u) to be? Answer: exactly p! In G(N,m), expect c(u) to be m/n(n-) Both cases: c(u) entirely determined by overall density Baseline for comparison with more clustered models Erdos-Renyi has no bias towards clustered or local edges Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -77 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -78

14 Watt s models: Caveman and Solaria Erdos-Renyi: sharing a common neighbor maes two vertices no more liely to be directly connected than two very distant vertices every edge appears entirely independently of existing structure But in many settings, the opposite is true: you tend to meet new friends through your old friends two web pages pointing to a third might share a topic two companies selling goods to a third are in related industries Watts Caveman world: overall density of edges is low but two vertices with a common neighbor are liely connected Watts Solaria world overall density of edges low; no special bias towards local edges lie Erdos-Renyi Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -79 Maing it (Somewhat) Precise: the α-model The α-model has the following parameters or nobs : N: size of the networ to be generated : the average degree of a vertex in the networ to be generated p: the default probability two vertices are connected α: adjustable parameter dictating bias towards local connections For any vertices u and v: define m(u,v) to be the number of common neighbors (so far) Key quantity: the propensity R(u,v) of u to connect to v if m(u,v) >=, R(u,v) = (share too many friends no to connect) if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect) else, R(u,v) = p + (m(u,v)/)^α (-p) Generate NW incrementally using R(u,v) as the edge probability; details omitted Note: α = infinity is lie Erdos-Renyi (but not exactly) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -80 Watts-Strogatz Model Small Worlds and Occam s Razor For small α, should generate large clustering coefficients we programmed the model to do so Watts claims that proving precise statements is hard But we do not want a new model for every little property Erdos-Renyi small diameter α-model high clustering coefficient In the interests of Occam s Razor, we would lie to find a single, simple model of networ generation that simultaneously captures many properties Watt s small world: small diameter and high clustering C(p) : clustering coeff. L(p) : average path length (Watts and Strogatz, Nature 9, 0 (998)) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Meanwhile, Bac in the Real World World Wide Web Watts examines three real networs as case studies: the Kevin Bacon graph the Western states power grid the nervous system For each of these networs, he: computes its size, diameter, and clustering coefficient compares diameter and clustering to best Erdos-Renyi approximate shows that the best α-model approximation is better important to be fair to each model by finding best fit Overall moral: if we care only about diameter and clustering, α is better than p Expected Result ~ 6 P(=00) ~ 0-99 N WWW ~ 0 9 N(=00)~0-90 Real Result γ out =. γ in =. P out () ~ -γout P(=00) ~ 0-6 P in () ~ - γin N WWW ~ 0 9 N(=00) ~ 0 J. Kleinberg, et. al, Proceedings of the ICCC (999) Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8

15 What does that mean? Scale-free Networs Poisson distribution Power-law distribution The number of nodes (N) is not fixed Networs continuously expand by additional new nodes WWW: addition of new nodes Citation: publication of new papers The attachment is not uniform A node is lined with higher probability to a node that already has a large number of lins WWW: new documents lin to well nown sites (CNN, Yahoo, Google) Citation: Well cited papers are more liely to be cited again Exponential Networ Scale-free Networ Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -8 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -86 Scale-Free Networs: the Barabasi-Albert model Growth: Starting with a small number (m 0 ) of nodes, at every time step, we add a new node with m(<< m 0 ) edges that lin the new node to m different nodes already present in the system. Preferential attachment: The probability Π that a new node will be connected to node i depends on the degree i of node i, such that Scale-Free Networs: Properties Preferential attachment explains heavy-tailed degree distributions small diameter (~log(n), via hubs ) Small average path length: After t time-steps the networ has N = t+m 0 nodes and m*t edges Vertices j with high degree are liely to get more lins ( Rich get richer ) The networ evolves into a stationary scale-free state with Node degree correlations: The dynamical process that creates a scale free networ builds up nontrivial correlations between the degrees of connected nodes. Clustering coefficient is ~ times larger that that of a random graph. Will not generate high clustering coefficient as with Watts-Strogatz model. Clustering coefficient decreases with networ size, following approximately a power law i.e. no bias towards local connectivity, but towards hubs Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -87 Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -88 Networ robustness The accidental failure of a number of nodes in a random networ can fracture the system into noncommunicating islands. Scale-free networs are more robust in the face of such failures. Scale-free networs are highly vulnerable to a coordinated attac against their hubs. Information Retrieval Summer Term 008 Sergej Sizov Chapter Lin Analysis and Authority Raning -89

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS LINK ANALYSIS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Retrieval evaluation Link analysis Models

More information

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search 6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search Daron Acemoglu and Asu Ozdaglar MIT September 30, 2009 1 Networks: Lecture 7 Outline Navigation (or decentralized search)

More information

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page

Hyperlinked-Induced Topic Search (HITS) identifies. authorities as good content sources (~high indegree) HITS [Kleinberg 99] considers a web page IV.3 HITS Hyperlinked-Induced Topic Search (HITS) identifies authorities as good content sources (~high indegree) hubs as good link sources (~high outdegree) HITS [Kleinberg 99] considers a web page a

More information

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof.

CS54701 Information Retrieval. Link Analysis. Luo Si. Department of Computer Science Purdue University. Borrowed Slides from Prof. CS54701 Information Retrieval Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU) Citation Analysis Web Structure Web is a graph Each web site

More information

<is web> Web Retrieval. Chapter 3: Authority Ranking. ISWeb Information Systems & Semantic Web University of Koblenz Landau. <Nr.

<is web> Web Retrieval. Chapter 3: Authority Ranking. ISWeb Information Systems & Semantic Web University of Koblenz Landau. <Nr. ISWeb Information Systems & Semantic Web University of Koblenz Landau Web Retrieval Chapter 3: Authority Ranking of 20 Chapter 1: Outline Authority Ranking: Motivation Collaboration environments

More information

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search Online Social Networks and Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart How to organize the web Second try: Web Search Information

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS473: Web Information Search and Management Using Graph Structure for Retrieval Prof. Chris Clifton 24 September 218 Material adapted from slides created by Dr. Rong Jin (formerly Michigan State, now

More information

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine CS 277: Data Mining Mining Web Link Structure Class Presentations In-class, Tuesday and Thursday next week 2-person teams: 6 minutes, up to 6 slides, 3 minutes/slides each person 1-person teams 4 minutes,

More information

Online Social Networks and Media. Link Analysis and Web Search

Online Social Networks and Media. Link Analysis and Web Search Online Social Networks and Media Link Analysis and Web Search How to Organize the Web First try: Human curated Web directories Yahoo, DMOZ, LookSmart How to organize the web Second try: Web Search Information

More information

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa

Introduction to Search Engine Technology Introduction to Link Structure Analysis. Ronny Lempel Yahoo Labs, Haifa Introduction to Search Engine Technology Introduction to Link Structure Analysis Ronny Lempel Yahoo Labs, Haifa Outline Anchor-text indexing Mathematical Background Motivation for link structure analysis

More information

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS DATA MINING LECTURE 3 Link Analysis Ranking PageRank -- Random walks HITS How to organize the web First try: Manually curated Web Directories How to organize the web Second try: Web Search Information

More information

Link Analysis Ranking

Link Analysis Ranking Link Analysis Ranking How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would you do it? Naïve ranking of query results Given query

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 10 Graphs II Rainer Gemulla, Pauli Miettinen Jul 4, 2013 Link analysis The web as a directed graph Set of web pages with associated textual content Hyperlinks between webpages

More information

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

6.207/14.15: Networks Lecture 12: Generalized Random Graphs 6.207/14.15: Networks Lecture 12: Generalized Random Graphs 1 Outline Small-world model Growing random networks Power-law degree distributions: Rich-Get-Richer effects Models: Uniform attachment model

More information

25.1 Markov Chain Monte Carlo (MCMC)

25.1 Markov Chain Monte Carlo (MCMC) CS880: Approximations Algorithms Scribe: Dave Andrzejewski Lecturer: Shuchi Chawla Topic: Approx counting/sampling, MCMC methods Date: 4/4/07 The previous lecture showed that, for self-reducible problems,

More information

1 Complex Networks - A Brief Overview

1 Complex Networks - A Brief Overview Power-law Degree Distributions 1 Complex Networks - A Brief Overview Complex networks occur in many social, technological and scientific settings. Examples of complex networks include World Wide Web, Internet,

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart

More information

The Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic

The Beginning of Graph Theory. Theory and Applications of Complex Networks. Eulerian paths. Graph Theory. Class Three. College of the Atlantic Theory and Applications of Complex Networs 1 Theory and Applications of Complex Networs 2 Theory and Applications of Complex Networs Class Three The Beginning of Graph Theory Leonhard Euler wonders, can

More information

Lecture 12: Link Analysis for Web Retrieval

Lecture 12: Link Analysis for Web Retrieval Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities

More information

Web Structure Mining Nodes, Links and Influence

Web Structure Mining Nodes, Links and Influence Web Structure Mining Nodes, Links and Influence 1 Outline 1. Importance of nodes 1. Centrality 2. Prestige 3. Page Rank 4. Hubs and Authority 5. Metrics comparison 2. Link analysis 3. Influence model 1.

More information

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson) Link Analysis Web Ranking Documents on the web are first ranked according to their relevance vrs the query Additional ranking methods are needed to cope with huge amount of information Additional ranking

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/30/17 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 8 Random Walks, Matrices and PageRank Graphs

More information

Degree Distribution: The case of Citation Networks

Degree Distribution: The case of Citation Networks Network Analysis Degree Distribution: The case of Citation Networks Papers (in almost all fields) refer to works done earlier on same/related topics Citations A network can be defined as Each node is

More information

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte

Wiki Definition. Reputation Systems I. Outline. Introduction to Reputations. Yury Lifshits. HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte Reputation Systems I HITS, PageRank, SALSA, ebay, EigenTrust, VKontakte Yury Lifshits Wiki Definition Reputation is the opinion (more technically, a social evaluation) of the public toward a person, a

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

CS246: Mining Massive Datasets Jure Leskovec, Stanford University. CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu What is the structure of the Web? How is it organized? 2/7/2011 Jure Leskovec, Stanford C246: Mining Massive

More information

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged.

PageRank algorithm Hubs and Authorities. Data mining. Web Data Mining PageRank, Hubs and Authorities. University of Szeged. Web Data Mining PageRank, University of Szeged Why ranking web pages is useful? We are starving for knowledge It earns Google a bunch of money. How? How does the Web looks like? Big strongly connected

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 Web pages are not equally important www.joe-schmoe.com

More information

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds

MAE 298, Lecture 8 Feb 4, Web search and decentralized search on small-worlds MAE 298, Lecture 8 Feb 4, 2008 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in a file-sharing

More information

MS&E 233 Lecture 9 & 10: Network models

MS&E 233 Lecture 9 & 10: Network models MS&E 233 Lecture 9 & 10: Networ models Ashish Goel, scribed by Riley Matthews April 30, May 2 Review: Application of PR to Netflix movie suggestions Client x: relm 1 ǫ # of users who lied m m: i lies m

More information

1 Searching the World Wide Web

1 Searching the World Wide Web Hubs and Authorities in a Hyperlinked Environment 1 Searching the World Wide Web Because diverse users each modify the link structure of the WWW within a relatively small scope by creating web-pages on

More information

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into

More information

Self Similar (Scale Free, Power Law) Networks (I)

Self Similar (Scale Free, Power Law) Networks (I) Self Similar (Scale Free, Power Law) Networks (I) E6083: lecture 4 Prof. Predrag R. Jelenković Dept. of Electrical Engineering Columbia University, NY 10027, USA {predrag}@ee.columbia.edu February 7, 2007

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Intro sessions to SNAP C++ and SNAP.PY: SNAP.PY: Friday 9/27, 4:5 5:30pm in Gates B03 SNAP

More information

Complex Social System, Elections. Introduction to Network Analysis 1

Complex Social System, Elections. Introduction to Network Analysis 1 Complex Social System, Elections Introduction to Network Analysis 1 Complex Social System, Network I person A voted for B A is more central than B if more people voted for A In-degree centrality index

More information

Link Analysis. Leonid E. Zhukov

Link Analysis. Leonid E. Zhukov Link Analysis Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics Structural Analysis and Visualization

More information

1 Mechanistic and generative models of network structure

1 Mechanistic and generative models of network structure 1 Mechanistic and generative models of network structure There are many models of network structure, and these largely can be divided into two classes: mechanistic models and generative or probabilistic

More information

Network models: dynamical growth and small world

Network models: dynamical growth and small world Network models: dynamical growth and small world Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University TheFind.com Large set of products (~6GB compressed) For each product A=ributes Related products Craigslist About 3 weeks of data

More information

Data science with multilayer networks: Mathematical foundations and applications

Data science with multilayer networks: Mathematical foundations and applications Data science with multilayer networks: Mathematical foundations and applications CDSE Days University at Buffalo, State University of New York Monday April 9, 2018 Dane Taylor Assistant Professor of Mathematics

More information

6.207/14.15: Networks Lecture 3: Erdös-Renyi graphs and Branching processes

6.207/14.15: Networks Lecture 3: Erdös-Renyi graphs and Branching processes 6.207/14.15: Networks Lecture 3: Erdös-Renyi graphs and Branching processes Daron Acemoglu and Asu Ozdaglar MIT September 16, 2009 1 Outline Erdös-Renyi random graph model Branching processes Phase transitions

More information

Network Science (overview, part 1)

Network Science (overview, part 1) Network Science (overview, part 1) Ralucca Gera, Applied Mathematics Dept. Naval Postgraduate School Monterey, California rgera@nps.edu Excellence Through Knowledge Overview Current research Section 1:

More information

Page rank computation HPC course project a.y

Page rank computation HPC course project a.y Page rank computation HPC course project a.y. 2015-16 Compute efficient and scalable Pagerank MPI, Multithreading, SSE 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and

More information

Probability & Computing

Probability & Computing Probability & Computing Stochastic Process time t {X t t 2 T } state space Ω X t 2 state x 2 discrete time: T is countable T = {0,, 2,...} discrete space: Ω is finite or countably infinite X 0,X,X 2,...

More information

Models and Algorithms for Complex Networks. Link Analysis Ranking

Models and Algorithms for Complex Networks. Link Analysis Ranking Models and Algorithms for Complex Networks Link Analysis Ranking Why Link Analysis? First generation search engines view documents as flat text files could not cope with size, spamming, user needs Second

More information

arxiv:cond-mat/ v2 6 Aug 2002

arxiv:cond-mat/ v2 6 Aug 2002 Percolation in Directed Scale-Free Networs N. Schwartz, R. Cohen, D. ben-avraham, A.-L. Barabási and S. Havlin Minerva Center and Department of Physics, Bar-Ilan University, Ramat-Gan, Israel Department

More information

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature.

Powerful tool for sampling from complicated distributions. Many use Markov chains to model events that arise in nature. Markov Chains Markov chains: 2SAT: Powerful tool for sampling from complicated distributions rely only on local moves to explore state space. Many use Markov chains to model events that arise in nature.

More information

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10

PageRank. Ryan Tibshirani /36-662: Data Mining. January Optional reading: ESL 14.10 PageRank Ryan Tibshirani 36-462/36-662: Data Mining January 24 2012 Optional reading: ESL 14.10 1 Information retrieval with the web Last time we learned about information retrieval. We learned how to

More information

Finding Authorities and Hubs From Link Structures on the World Wide Web

Finding Authorities and Hubs From Link Structures on the World Wide Web Finding Authorities and Hubs From Link Structures on the World Wide Web Allan Borodin Gareth O. Roberts Jeffrey S. Rosenthal Panayiotis Tsaparas October 4, 2006 Abstract Recently, there have been a number

More information

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank

More information

Data and Algorithms of the Web

Data and Algorithms of the Web Data and Algorithms of the Web Link Analysis Algorithms Page Rank some slides from: Anand Rajaraman, Jeffrey D. Ullman InfoLab (Stanford University) Link Analysis Algorithms Page Rank Hubs and Authorities

More information

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018 Lab 8: Measuring Graph Centrality - PageRank Monday, November 5 CompSci 531, Fall 2018 Outline Measuring Graph Centrality: Motivation Random Walks, Markov Chains, and Stationarity Distributions Google

More information

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 4 May 2000

arxiv:cond-mat/ v1 [cond-mat.dis-nn] 4 May 2000 Topology of evolving networks: local events and universality arxiv:cond-mat/0005085v1 [cond-mat.dis-nn] 4 May 2000 Réka Albert and Albert-László Barabási Department of Physics, University of Notre-Dame,

More information

networks in molecular biology Wolfgang Huber

networks in molecular biology Wolfgang Huber networks in molecular biology Wolfgang Huber networks in molecular biology Regulatory networks: components = gene products interactions = regulation of transcription, translation, phosphorylation... Metabolic

More information

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for slides Web Search: How to Organize the Web? Ranking Nodes on Graphs Hubs and Authorities PageRank How to Solve PageRank

More information

Decision Making and Social Networks

Decision Making and Social Networks Decision Making and Social Networks Lecture 4: Models of Network Growth Umberto Grandi Summer 2013 Overview In the previous lecture: We got acquainted with graphs and networks We saw lots of definitions:

More information

Deterministic Decentralized Search in Random Graphs

Deterministic Decentralized Search in Random Graphs Deterministic Decentralized Search in Random Graphs Esteban Arcaute 1,, Ning Chen 2,, Ravi Kumar 3, David Liben-Nowell 4,, Mohammad Mahdian 3, Hamid Nazerzadeh 1,, and Ying Xu 1, 1 Stanford University.

More information

Finding central nodes in large networks

Finding central nodes in large networks Finding central nodes in large networks Nelly Litvak University of Twente Eindhoven University of Technology, The Netherlands Woudschoten Conference 2017 Complex networks Networks: Internet, WWW, social

More information

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson)

Web Ranking. Classification (manual, automatic) Link Analysis (today s lesson) Link Analysis Web Ranking Documents on the web are first ranked according to their relevance vrs the query Additional ranking methods are needed to cope with huge amount of information Additional ranking

More information

Computing PageRank using Power Extrapolation

Computing PageRank using Power Extrapolation Computing PageRank using Power Extrapolation Taher Haveliwala, Sepandar Kamvar, Dan Klein, Chris Manning, and Gene Golub Stanford University Abstract. We present a novel technique for speeding up the computation

More information

Modeling of Growing Networks with Directional Attachment and Communities

Modeling of Growing Networks with Directional Attachment and Communities Modeling of Growing Networks with Directional Attachment and Communities Masahiro KIMURA, Kazumi SAITO, Naonori UEDA NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho, Kyoto 619-0237, Japan

More information

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci

Link Analysis Information Retrieval and Data Mining. Prof. Matteo Matteucci Link Analysis Information Retrieval and Data Mining Prof. Matteo Matteucci Hyperlinks for Indexing and Ranking 2 Page A Hyperlink Page B Intuitions The anchor text might describe the target page B Anchor

More information

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano Google PageRank Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano fricci@unibz.it 1 Content p Linear Algebra p Matrices p Eigenvalues and eigenvectors p Markov chains p Google

More information

Spectral Analysis of Directed Complex Networks. Tetsuro Murai

Spectral Analysis of Directed Complex Networks. Tetsuro Murai MASTER THESIS Spectral Analysis of Directed Complex Networks Tetsuro Murai Department of Physics, Graduate School of Science and Engineering, Aoyama Gakuin University Supervisors: Naomichi Hatano and Kenn

More information

The Static Absorbing Model for the Web a

The Static Absorbing Model for the Web a Journal of Web Engineering, Vol. 0, No. 0 (2003) 000 000 c Rinton Press The Static Absorbing Model for the Web a Vassilis Plachouras University of Glasgow Glasgow G12 8QQ UK vassilis@dcs.gla.ac.uk Iadh

More information

A LINE GRAPH as a model of a social network

A LINE GRAPH as a model of a social network A LINE GRAPH as a model of a social networ Małgorzata Krawczy, Lev Muchni, Anna Mańa-Krasoń, Krzysztof Kułaowsi AGH Kraów Stern School of Business of NY University outline - ideas, definitions, milestones

More information

IR: Information Retrieval

IR: Information Retrieval / 44 IR: Information Retrieval FIB, Master in Innovation and Research in Informatics Slides by Marta Arias, José Luis Balcázar, Ramon Ferrer-i-Cancho, Ricard Gavaldá Department of Computer Science, UPC

More information

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505 INTRODUCTION TO MCMC AND PAGERANK Eric Vigoda Georgia Tech Lecture for CS 6505 1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

More information

Markov Processes Hamid R. Rabiee

Markov Processes Hamid R. Rabiee Markov Processes Hamid R. Rabiee Overview Markov Property Markov Chains Definition Stationary Property Paths in Markov Chains Classification of States Steady States in MCs. 2 Markov Property A discrete

More information

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee wwlee1@uiuc.edu September 28, 2004 Motivation IR on newsgroups is challenging due to lack of

More information

Markov Chains, Random Walks on Graphs, and the Laplacian

Markov Chains, Random Walks on Graphs, and the Laplacian Markov Chains, Random Walks on Graphs, and the Laplacian CMPSCI 791BB: Advanced ML Sridhar Mahadevan Random Walks! There is significant interest in the problem of random walks! Markov chain analysis! Computer

More information

Social Networks- Stanley Milgram (1967)

Social Networks- Stanley Milgram (1967) Complex Networs Networ is a structure of N nodes and 2M lins (or M edges) Called also graph in Mathematics Many examples of networs Internet: nodes represent computers lins the connecting cables Social

More information

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505

INTRODUCTION TO MCMC AND PAGERANK. Eric Vigoda Georgia Tech. Lecture for CS 6505 INTRODUCTION TO MCMC AND PAGERANK Eric Vigoda Georgia Tech Lecture for CS 6505 1 MARKOV CHAIN BASICS 2 ERGODICITY 3 WHAT IS THE STATIONARY DISTRIBUTION? 4 PAGERANK 5 MIXING TIME 6 PREVIEW OF FURTHER TOPICS

More information

Perturbation of the hyper-linked environment

Perturbation of the hyper-linked environment Perturbation of the hyper-linked environment Hyun Chul Lee and Allan Borodin Department of Computer Science University of Toronto Toronto, Ontario, M5S3G4 {leehyun,bor}@cs.toronto.edu Abstract. After the

More information

Updating PageRank. Amy Langville Carl Meyer

Updating PageRank. Amy Langville Carl Meyer Updating PageRank Amy Langville Carl Meyer Department of Mathematics North Carolina State University Raleigh, NC SCCM 11/17/2003 Indexing Google Must index key terms on each page Robots crawl the web software

More information

Complex-Network Modelling and Inference

Complex-Network Modelling and Inference Complex-Network Modelling and Inference Lecture 12: Random Graphs: preferential-attachment models Matthew Roughan http://www.maths.adelaide.edu.au/matthew.roughan/notes/

More information

6.207/14.15: Networks Lecture 4: Erdös-Renyi Graphs and Phase Transitions

6.207/14.15: Networks Lecture 4: Erdös-Renyi Graphs and Phase Transitions 6.207/14.15: Networks Lecture 4: Erdös-Renyi Graphs and Phase Transitions Daron Acemoglu and Asu Ozdaglar MIT September 21, 2009 1 Outline Phase transitions Connectivity threshold Emergence and size of

More information

EECS 495: Randomized Algorithms Lecture 14 Random Walks. j p ij = 1. Pr[X t+1 = j X 0 = i 0,..., X t 1 = i t 1, X t = i] = Pr[X t+

EECS 495: Randomized Algorithms Lecture 14 Random Walks. j p ij = 1. Pr[X t+1 = j X 0 = i 0,..., X t 1 = i t 1, X t = i] = Pr[X t+ EECS 495: Randomized Algorithms Lecture 14 Random Walks Reading: Motwani-Raghavan Chapter 6 Powerful tool for sampling complicated distributions since use only local moves Given: to explore state space.

More information

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities

6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities 6.207/14.15: Networks Lectures 4, 5 & 6: Linear Dynamics, Markov Chains, Centralities 1 Outline Outline Dynamical systems. Linear and Non-linear. Convergence. Linear algebra and Lyapunov functions. Markov

More information

Theorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies,

Theorem (Special Case of Ramsey s Theorem) R(k, l) is finite. Furthermore, it satisfies, Math 16A Notes, Wee 6 Scribe: Jesse Benavides Disclaimer: These notes are not nearly as polished (and quite possibly not nearly as correct) as a published paper. Please use them at your own ris. 1. Ramsey

More information

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation

As it is not necessarily possible to satisfy this equation, we just ask for a solution to the more general equation Graphs and Networks Page 1 Lecture 2, Ranking 1 Tuesday, September 12, 2006 1:14 PM I. II. I. How search engines work: a. Crawl the web, creating a database b. Answer query somehow, e.g. grep. (ex. Funk

More information

Networks as a tool for Complex systems

Networks as a tool for Complex systems Complex Networs Networ is a structure of N nodes and 2M lins (or M edges) Called also graph in Mathematics Many examples of networs Internet: nodes represent computers lins the connecting cables Social

More information

arxiv:cond-mat/ v1 3 Sep 2004

arxiv:cond-mat/ v1 3 Sep 2004 Effects of Community Structure on Search and Ranking in Information Networks Huafeng Xie 1,3, Koon-Kiu Yan 2,3, Sergei Maslov 3 1 New Media Lab, The Graduate Center, CUNY New York, NY 10016, USA 2 Department

More information

A Note on Google s PageRank

A Note on Google s PageRank A Note on Google s PageRank According to Google, google-search on a given topic results in a listing of most relevant web pages related to the topic. Google ranks the importance of webpages according to

More information

Basic graph theory 18.S995 - L26.

Basic graph theory 18.S995 - L26. Basic graph theory 18.S995 - L26 dunkel@math.mit.edu http://java.dzone.com/articles/algorithm-week-graphs-and no cycles Isomorphic graphs f(a) = 1 f(b) = 6 f(c) = 8 f(d) = 3 f(g) = 5 f(h) = 2 f(i) = 4

More information

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states.

Definition A finite Markov chain is a memoryless homogeneous discrete stochastic process with a finite number of states. Chapter 8 Finite Markov Chains A discrete system is characterized by a set V of states and transitions between the states. V is referred to as the state space. We think of the transitions as occurring

More information

Network models: random graphs

Network models: random graphs Network models: random graphs Leonid E. Zhukov School of Data Analysis and Artificial Intelligence Department of Computer Science National Research University Higher School of Economics Structural Analysis

More information

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Link Analysis. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Link Analysis Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 The Web as a Directed Graph Page A Anchor hyperlink Page B Assumption 1: A hyperlink between pages

More information

Lecture 20 : Markov Chains

Lecture 20 : Markov Chains CSCI 3560 Probability and Computing Instructor: Bogdan Chlebus Lecture 0 : Markov Chains We consider stochastic processes. A process represents a system that evolves through incremental changes called

More information

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68

CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68 CSI 445/660 Part 6 (Centrality Measures for Networks) 6 1 / 68 References 1 L. Freeman, Centrality in Social Networks: Conceptual Clarification, Social Networks, Vol. 1, 1978/1979, pp. 215 239. 2 S. Wasserman

More information

arxiv: v2 [stat.ap] 9 Jun 2018

arxiv: v2 [stat.ap] 9 Jun 2018 Estimating Shortest Path Length Distributions via Random Walk Sampling Minhui Zheng and Bruce D. Spencer arxiv:1806.01757v2 [stat.ap] 9 Jun 2018 Department of Statistics, Northwestern University June 12,

More information

Math 304 Handout: Linear algebra, graphs, and networks.

Math 304 Handout: Linear algebra, graphs, and networks. Math 30 Handout: Linear algebra, graphs, and networks. December, 006. GRAPHS AND ADJACENCY MATRICES. Definition. A graph is a collection of vertices connected by edges. A directed graph is a graph all

More information

1998: enter Link Analysis

1998: enter Link Analysis 1998: enter Link Analysis uses hyperlink structure to focus the relevant set combine traditional IR score with popularity score Page and Brin 1998 Kleinberg Web Information Retrieval IR before the Web

More information

How does Google rank webpages?

How does Google rank webpages? Linear Algebra Spring 016 How does Google rank webpages? Dept. of Internet and Multimedia Eng. Konkuk University leehw@konkuk.ac.kr 1 Background on search engines Outline HITS algorithm (Jon Kleinberg)

More information

Markov Chains and Stochastic Sampling

Markov Chains and Stochastic Sampling Part I Markov Chains and Stochastic Sampling 1 Markov Chains and Random Walks on Graphs 1.1 Structure of Finite Markov Chains We shall only consider Markov chains with a finite, but usually very large,

More information

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham

Intelligent Data Analysis. PageRank. School of Computer Science University of Birmingham Intelligent Data Analysis PageRank Peter Tiňo School of Computer Science University of Birmingham Information Retrieval on the Web Most scoring methods on the Web have been derived in the context of Information

More information

arxiv: v1 [physics.soc-ph] 15 Dec 2009

arxiv: v1 [physics.soc-ph] 15 Dec 2009 Power laws of the in-degree and out-degree distributions of complex networks arxiv:0912.2793v1 [physics.soc-ph] 15 Dec 2009 Shinji Tanimoto Department of Mathematics, Kochi Joshi University, Kochi 780-8515,

More information

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks Recap Probability, stochastic processes, Markov chains ELEC-C7210 Modeling and analysis of communication networks 1 Recap: Probability theory important distributions Discrete distributions Geometric distribution

More information

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11

Undirected Graphs. V = { 1, 2, 3, 4, 5, 6, 7, 8 } E = { 1-2, 1-3, 2-3, 2-4, 2-5, 3-5, 3-7, 3-8, 4-5, 5-6 } n = 8 m = 11 Undirected Graphs Undirected graph. G = (V, E) V = nodes. E = edges between pairs of nodes. Captures pairwise relationship between objects. Graph size parameters: n = V, m = E. V = {, 2, 3,,,, 7, 8 } E

More information