CS322: Network Analysis. Problem Set 2

Due October 9 009 i class CS3: Network Aalysis Problem Set - Fall 009 If you have ay questios regardig the problems set, sed a email to the course assistats: simlac@staford.edu ad peleato@staford.edu. Please write the ame of your collaborators o your problem set. You ca use existig software or code to compute the aswers, you do t have to submit the source code. The Problems Problem. (From Easley ad Kleiberg, Networks) I the basic six degrees of separatio questio, oe asks whether most pairs of people i the world are coected by a path of at most six edges i the social etwork, where a edge jois ay two people who kow each other o a first-ame basis. Now let s cosider a variatio o this questio. Suppose that we cosider the full populatio of the world, ad suppose that from each perso i the world we create a directed edge oly to their te closest frieds (but ot to ayoe else they kow o a first-ame basis). I the resultig closest-fried versio of the social etwork, is it possible that for each pair of people i the world, there is a path of at most six edges coectig this pair of people? Explai. Solutio: I the described etwork, there will be a pair of people such that there is o path of at most six edges coectig them. Let us fix a perso, p, i the etwork ad cosider the set of people who are withi 6 steps from that perso. The largest size of this set will occur i the case of a tree rooted at that perso. So, the largest size (assumig directed edges) is the followig; (perso p)+0 (um. of people i distace )+00 (um. of people i distace )+000+ 0000+00000+000000 =, which is clearly lot less tha the world populatio (6 billio). Hece, such a graph caot coect every two people by a path of at most 6 edges. Problem. You are developig a protocol to establish a peer-to-peer overlay etwork amog odes. This protocol operates as follows.

CS 3: Network Aalysis - Problem Set Step : Each ode flips a coi (-) times to decide whether it geerates a edge to each of the other (-) odes. The probability of doig so is p. Liks are assumed udirected, regardless of which side establishes them. If two odes flip their correspodig cois ad both decide to coect to each other, oly oe edge is created. Step : After this is doe, every ode ot yet coected selects aother ode at radom ad establishes a lik to this ode. If you let p = log /(), does this protocol establish a coected etwork for large? (Hit: determie what small compoets exist after Step, ad i particular, the umber of isolated vertices.) What would your aswer be if p was oly /? Solutio: [We had origially thought of a differet solutio, but Stephe Dea Guo came up with the idea for the better oe below] If each side ca establish a edge with probability p, ( the probability ) of ay give edge existig i the etwork is p p. We realize that log() log() log() whe teds to ifiity, so we ca assume that our graph is a G(, log() ), i.e., the probability of ay edge beig preset is log() (heceforth we will call this p). You might remember that this is exactly the threshold for coectivity of a radom graph, so the proof will be somehow trickier tha ay other case. Some of you expressed cocer over the theorem statig that ɛ > 0 the Erdos-Reyi graph with p = ( ɛ) log() is discoected. However, the p term we eglected above caot be viewed as that ɛ, sice the ɛ is supposed to be a small CONSTANT greater tha zero, ad p decreases with. Let k m be the expected umber of discoected compoets of size m. Give a subset of m odes, they will be discoected from the rest iff all m(-m) edges betwee them ad the ( ) m( m) rest of the graph are missig. The probability of this happeig is log(). O the other had, the probability that all m odes form a sigle compoet ca be bouded usig Cayley s theorem (The umber of differet spaig trees i a set of m odes is m m ). Ay coected compoet with m odes will cotai at least oe spaig tree. Therefore we have the followig chai of upper bouds: P (m odes are coected) P (there is a spaig tree) m m i= = m m p m P (spaig tree umber i is preset) where the secod iequality comes from the uio boud, ad the last equality from the fact

CS 3: Network Aalysis - Problem Set 3 that all spaig trees have the same umber of edges (m-). Takig ito accout that there are ( m) possible subsets of m odes we fially get, k m u m = ( ) ( log() ) m( m) m m p m. m We foud a upper boud for k m, which we will call u m for reasos that will become clear later. Massagig a bit the above expressio ad takig limits for large, we get k m m m! mm e log()m m = mm m log() m m! ( log() ) m Hece, for large, k = ad k m = 0 for all m >. Step will take care of the isolated ode, ad the expected umber of larger compoets beig isolated goes to zero. Ufortuately, this is ot yet eough to assure that there will be o isolated compoets. Sice the size of the possible compoets icreases with, we eed to prove that their probability decreases fast eough so that i= k i goes to zero. [For example, if we had k m = m, the the expected umber of isolated compoets of size m would be 0 for all m, but the expected umber of isolated compoets of ay size would be!!!] We kow that i= k i i= u i. Lets fid the ratio betwee u m+ ad u m whe teds to ifiity: u m u m+ = = = ( m+ ( m) ( ) m( m) log() m m ( ) m log() ) ( ) (m+)( m ) log() (m + ) m (m + )m m ( m)(m + ) m (m + )mm (m + ) m log() ( log() ( ) m log() ) m + log() Thus, the expected umber of isolated compoets of size m decreases as log() icremet of m. Neglectig the costats, we ca the boud the sum as: with each k i i= u i k i= ( ) i log() ( ) i log() < k = k i=0 log() i=0 which teds to zero as teds to ifiity.

4 CS 3: Network Aalysis - Problem Set Fially, lets study the case of p =. Give ay two odes, the probability that they are ( discoected from the rest ad coected to each other is ( ) ) which is always larger tha e 4. This probability teds to zero, but sice the umber of possible pairs icreases with the umber of odes as O( ), a costat fractio of the odes will form isolated pairs (which step will ot recoect). Problem.3 Geerate a dataset of millio values followig a power-law distributio with expoet.5. The compute experimetally the expoet of the distributio, usig the followig 4 methods: Refer to Power-law distributios i empirical data by Clauset, Shalizi ad Newma for how to geerate radom umbers from a power-law distributio. a) Fittig a lie to the frequecy distributio. b) Fittig a lie to the frequecy distributio with logarithmic biig. c) Usig the complemetary CDF. d) Usig the maximum likelihood estimate. Solutio: 0 6 loglog plot of frequecy 0 6 loglog plot with logarithm biig 0 5 0 5 0 4 0 4 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 3 0 4 0 5 0 0 0 0 0 0 0 3 0 4 0 5 0 6 loglog plot of cdf 0 6 logarithm biig + cdf 0 5 0 5 0 4 0 4 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 3 0 4 0 5 0 0 0 0 0 0 0 3 0 4 0 5 Figure : Plots for expoet estimatio The data is geerated by geeratig a vector r of 0 6 umbers uiformly from [0, ] ad apply the trasformatio x = ( r) /3. We work with the cotiuous model i this problem. The calculatio for discrete model is very similar. See Figure for the plots. (a) By settig bis of width ad doig liear regressio of the frequecies i the loglog scale we get α = 0.94. The problem is that i the tail there are a lot of empty bis, so the

CS 3: Network Aalysis - Problem Set 5 liear regressio fits a flat lie. (b) Let bi i be [. i,. i ]. We cout the frequecy i each bi ad ormalize it by the width of the bi. Now by liear regressio i the loglog scale we get α =.7895. We obtaied a total of 0 bis ad the oise i the tail is ot egligible. If we use oly the first 60 bis for regressio the the aswer is very accurate (α =.507). Also it should be oted that if the couts for each bi is ot ormalized, we get a better estimate α =.364. This is oe of the weird effect of those empty bis. (c) Here we compute the CDF ad do regressio i loglog scale, ad icremet the resulted alpha by. If costat width bis are used as i (a) we get α =.3533. If logarithmic biig is used the α =.4567. (d) Usig the MLE estimate we get α = + [ i= l x i x mi ] =.4983. Problem.4 Cosider the followig evolvig model for geeratig a udirected graph. Iitially there are oly three odes coected ito a triagle. At every time step, a edge of the curret etwork is selected uiformly at radom, ad a ew ode is added to the etwork that liks to both the edpoits of the edge. Prove that p k, the fractio of odes with degree k, follows a power law with expoet 3. Provide a ituitive explaatio as to why this model is the same as the preferetial attachmet model. Solutio: Let d i (t) deote the degree of ode i at time t. Node i oly gets a ew edge at time t+ if oe of his edges is picked. Hece, the expected value of d i (t + ) will be: We ca the approximate E[d i (t + )] = d i (t) ( + 3 + t ) d i (t) t d i(t) 3 + t. Solvig the differetial equatio with the iitial coditio that d i (i) = we obtai d i (t) = ( ) 3 + t. 3 + i Just as we did i class, we ca ow fid which odes have degree higher tha k at time t: i k (3 + t) 3. At time t there are 3+t odes i the etwork, so the desired fractio is p k = (3 + (3+t)k t) 3. This expressio ca be cosidered the cdf (cumulative distributio fuctio) of (3+t) the degrees at time t. By derivatig respect to k ad makig t ted to ifiity, we get the asymptotic probability distributio: p k 8 k 3

6 CS 3: Network Aalysis - Problem Set This model is the same as the preferetial attachmet because i both cases odes the probability that a ode gets a ew edge is proportioal to its curret degree. Problem.5 I this exercise we will study the distributio of words i the Eglish laguage. The data cosists of a list of all the words i a dictioary ad a text versio of A tale of Two Cities by Charles Dickes (foud at project Guteberg). I the later, we have removed puctuatio, apostrophes, etc... keepig oly the 6 characters i the alphabet ad the space. (a) Write a program that reads the list of words provided ad plot a graph showig the umber of words that there exist of legths betwee 3 ad 8 (you ca discard all other words). How fast does such umber icrease? (b) Usig the ovel A Tale of Two Cities as a represetative sample, we ow plot how frequetly each words is used i the Eglish laguage. Sort the words i the ovel alog the x axis from the most frequet to the least, ad plot their umber of appearaces (may words i the dictioary will ot be i the ovel. You should ot take those ito accout). Does it follow a power law? If so, fid a approximatio for the expoet. If you looked further ito the previous plot, you would see that the most frequet words are usually shorter. We ow develop models to explai why, if log words are more umerous i the dictioary, authors use short oes more ofte. (c) Assume that a mokey typed oe billio (0 9 ) radom characters o a keyboard (6 letters + space bar), ad call word ay sequece of letters betwee two spaces. Fid f(), the expected umber of times that a GIVEN sequece of legth would appear i the mokey s text (with spaces at both sides). Does f() follow a power law? If so, fid a approximatio for the expoet. (d) I average, how may times would the 00-th most frequet word appear i the mokey s text? What about the 000-th? (Hit: how log would those words be? Either simulate it or fid a aalytic expressio) Is this a good model for the results i (b)? (e) We will try to further improve the model by assigig differet probabilities to differet characters. Fid the probability of each character (icludig space) i A Tale of Two Cities ad geerate te thousad words accordig to that distributio. Repeat the plot i part (b) for this ew text. Is the model better?

CS 3: Network Aalysis - Problem Set 7 Solutio: (a) The umber of words of a give legth icreases liearly betwee 3 ad 8. 0000 5000 0000 5000 0 3 4 5 7 6 8 (b) Yes, it follows a power law, approximately with expoet -. 5 4 3 0 0 3 4 5 (c) Usig the uio boud, we get f () = 09 76+. Rigorously speakig, it would be slightly smaller, sice this is just a upper boud. It does ot decrease accordig to a power law, but expoetially, as it becomes clear from the previous expressio. (d) I average, ay two letter word will be more frequet tha ay three letter oe, while two words with the same umber of characters have the same chaces of appearig. Therefore, the first 6 most frequet words will be -character oes. The we will have the 6 two letter oes, which will roughly appear f() times. Fially, the 000t h most frequet word will have three characters, ad appear with a frequecy of f(3). It is ot a good model for our data. It is too step-like. Although it is true that the two expoetials cacel each other (icreasig umber of words ad decreasig frequecy) givig a power law, it does ot capture the progressive descet that we observed i (b).

8 CS 3: Network Aalysis - Problem Set.0.5.0 0.5 0.0 0.5 0.5 0.0 0.5.0.5.0.5 3.0 3.5 (e) The model does improve. But there is still a large umber of words that appear just oce. By icreasig the legth of the radomly geerated text we could improve the precisio at the tail..5.0.5.0 0.5 0.0 0.5 0.5 0.0 0.5.0.5.0.5 3.0

CS322: Network Analysis. Problem Set 2 - Fall 2009