CS322: Network Analysis. Problem Set 2 - Fall 2009

Similar documents
UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 17 Lecturer: David Wagner April 3, Notes 17 for CS 170

Infinite Sequences and Series

Series III. Chapter Alternating Series

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

Physics 116A Solutions to Homework Set #1 Winter Boas, problem Use equation 1.8 to find a fraction describing

Disjoint set (Union-Find)

PRACTICE PROBLEMS FOR THE FINAL

Shannon s noiseless coding theorem

MA131 - Analysis 1. Workbook 2 Sequences I

CHAPTER 10 INFINITE SEQUENCES AND SERIES

Application to Random Graphs

NCSS Statistical Software. Tolerance Intervals

Chapter 6 Sampling Distributions

Problem Cosider the curve give parametrically as x = si t ad y = + cos t for» t» ß: (a) Describe the path this traverses: Where does it start (whe t =

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Problem Set 2 Solutions

Please do NOT write in this box. Multiple Choice. Total

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Mixtures of Gaussians and the EM Algorithm

On a Smarandache problem concerning the prime gaps

10-701/ Machine Learning Mid-term Exam Solution

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

Lecture 14: Graph Entropy

MATH 10550, EXAM 3 SOLUTIONS

6.3 Testing Series With Positive Terms

Sequences I. Chapter Introduction

Chapter 4. Fourier Series

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

NUMERICAL METHODS FOR SOLVING EQUATIONS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

The picture in figure 1.1 helps us to see that the area represents the distance traveled. Figure 1: Area represents distance travelled

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Math 216A Notes, Week 5

Sequences A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

Statistics 511 Additional Materials

18.440, March 9, Stirling s formula

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Frequentist Inference

( ) = p and P( i = b) = q.

Chapter 10: Power Series

Machine Learning Brett Bernstein

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

HOMEWORK 2 SOLUTIONS

Topic 9: Sampling Distributions of Estimators

MIDTERM 3 CALCULUS 2. Monday, December 3, :15 PM to 6:45 PM. Name PRACTICE EXAM SOLUTIONS

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

MA131 - Analysis 1. Workbook 9 Series III

Math 10A final exam, December 16, 2016

INEQUALITIES BJORN POONEN

Topic 9: Sampling Distributions of Estimators

4.3 Growth Rates of Solutions to Recurrences

A sequence of numbers is a function whose domain is the positive integers. We can see that the sequence

1 Review of Probability & Statistics

Lecture 2 Long paths in random graphs

End-of-Year Contest. ERHS Math Club. May 5, 2009

The Random Walk For Dummies

1 Inferential Methods for Correlation and Regression Analysis

AP Calculus AB 2006 Scoring Guidelines Form B

Carleton College, Winter 2017 Math 121, Practice Final Prof. Jones. Note: the exam will have a section of true-false questions, like the one below.

Lecture Chapter 6: Convergence of Random Sequences

Lecture 9: Hierarchy Theorems

Math 155 (Lecture 3)

Approximations and more PMFs and PDFs

Simple Linear Regression

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Topic 9: Sampling Distributions of Estimators

CS 330 Discussion - Probability

Lecture 2: Monte Carlo Simulation

10.6 ALTERNATING SERIES

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand Final Solutions

Read carefully the instructions on the answer book and make sure that the particulars required are entered on each answer book.

INFINITE SEQUENCES AND SERIES

Kinetics of Complex Reactions

Section 6.4: Series. Section 6.4 Series 413

Optimally Sparse SVMs

Math 140 Introductory Statistics

n 3 ln n n ln n is convergent by p-series for p = 2 > 1. n2 Therefore we can apply Limit Comparison Test to determine lutely convergent.

Problem Set 4 Due Oct, 12

CS / MCS 401 Homework 3 grader solutions

4.1 Sigma Notation and Riemann Sums

September 2012 C1 Note. C1 Notes (Edexcel) Copyright - For AS, A2 notes and IGCSE / GCSE worksheets 1

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

CS284A: Representations and Algorithms in Molecular Biology

1 Generating functions for balls in boxes

Math 475, Problem Set #12: Answers

Ma 4121: Introduction to Lebesgue Integration Solutions to Homework Assignment 5

Lecture Notes for Analysis Class

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

Math 113 Exam 3 Practice

Parameter, Statistic and Random Samples


MA131 - Analysis 1. Workbook 3 Sequences II

Feedback in Iterative Algorithms

STAT Homework 1 - Solutions

PUTNAM TRAINING PROBABILITY

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Real Variables II Homework Set #5

# fixed points of g. Tree to string. Repeatedly select the leaf with the smallest label, write down the label of its neighbour and remove the leaf.

Estimation for Complete Data

Transcription:

Due October 9 009 i class CS3: Network Aalysis Problem Set - Fall 009 If you have ay questios regardig the problems set, sed a email to the course assistats: simlac@staford.edu ad peleato@staford.edu. Please write the ame of your collaborators o your problem set. You ca use existig software or code to compute the aswers, you do t have to submit the source code. The Problems Problem. (From Easley ad Kleiberg, Networks) I the basic six degrees of separatio questio, oe asks whether most pairs of people i the world are coected by a path of at most six edges i the social etwork, where a edge jois ay two people who kow each other o a first-ame basis. Now let s cosider a variatio o this questio. Suppose that we cosider the full populatio of the world, ad suppose that from each perso i the world we create a directed edge oly to their te closest frieds (but ot to ayoe else they kow o a first-ame basis). I the resultig closest-fried versio of the social etwork, is it possible that for each pair of people i the world, there is a path of at most six edges coectig this pair of people? Explai. Solutio: I the described etwork, there will be a pair of people such that there is o path of at most six edges coectig them. Let us fix a perso, p, i the etwork ad cosider the set of people who are withi 6 steps from that perso. The largest size of this set will occur i the case of a tree rooted at that perso. So, the largest size (assumig directed edges) is the followig; (perso p)+0 (um. of people i distace )+00 (um. of people i distace )+000+ 0000+00000+000000 =, which is clearly lot less tha the world populatio (6 billio). Hece, such a graph caot coect every two people by a path of at most 6 edges. Problem. You are developig a protocol to establish a peer-to-peer overlay etwork amog odes. This protocol operates as follows.

CS 3: Network Aalysis - Problem Set Step : Each ode flips a coi (-) times to decide whether it geerates a edge to each of the other (-) odes. The probability of doig so is p. Liks are assumed udirected, regardless of which side establishes them. If two odes flip their correspodig cois ad both decide to coect to each other, oly oe edge is created. Step : After this is doe, every ode ot yet coected selects aother ode at radom ad establishes a lik to this ode. If you let p = log /(), does this protocol establish a coected etwork for large? (Hit: determie what small compoets exist after Step, ad i particular, the umber of isolated vertices.) What would your aswer be if p was oly /? Solutio: [We had origially thought of a differet solutio, but Stephe Dea Guo came up with the idea for the better oe below] If each side ca establish a edge with probability p, ( the probability ) of ay give edge existig i the etwork is p p. We realize that log() log() log() whe teds to ifiity, so we ca assume that our graph is a G(, log() ), i.e., the probability of ay edge beig preset is log() (heceforth we will call this p). You might remember that this is exactly the threshold for coectivity of a radom graph, so the proof will be somehow trickier tha ay other case. Some of you expressed cocer over the theorem statig that ɛ > 0 the Erdos-Reyi graph with p = ( ɛ) log() is discoected. However, the p term we eglected above caot be viewed as that ɛ, sice the ɛ is supposed to be a small CONSTANT greater tha zero, ad p decreases with. Let k m be the expected umber of discoected compoets of size m. Give a subset of m odes, they will be discoected from the rest iff all m(-m) edges betwee them ad the ( ) m( m) rest of the graph are missig. The probability of this happeig is log(). O the other had, the probability that all m odes form a sigle compoet ca be bouded usig Cayley s theorem (The umber of differet spaig trees i a set of m odes is m m ). Ay coected compoet with m odes will cotai at least oe spaig tree. Therefore we have the followig chai of upper bouds: P (m odes are coected) P (there is a spaig tree) m m i= = m m p m P (spaig tree umber i is preset) where the secod iequality comes from the uio boud, ad the last equality from the fact

CS 3: Network Aalysis - Problem Set 3 that all spaig trees have the same umber of edges (m-). Takig ito accout that there are ( m) possible subsets of m odes we fially get, k m u m = ( ) ( log() ) m( m) m m p m. m We foud a upper boud for k m, which we will call u m for reasos that will become clear later. Massagig a bit the above expressio ad takig limits for large, we get k m m m! mm e log()m m = mm m log() m m! ( log() ) m Hece, for large, k = ad k m = 0 for all m >. Step will take care of the isolated ode, ad the expected umber of larger compoets beig isolated goes to zero. Ufortuately, this is ot yet eough to assure that there will be o isolated compoets. Sice the size of the possible compoets icreases with, we eed to prove that their probability decreases fast eough so that i= k i goes to zero. [For example, if we had k m = m, the the expected umber of isolated compoets of size m would be 0 for all m, but the expected umber of isolated compoets of ay size would be!!!] We kow that i= k i i= u i. Lets fid the ratio betwee u m+ ad u m whe teds to ifiity: u m u m+ = = = ( m+ ( m) ( ) m( m) log() m m ( ) m log() ) ( ) (m+)( m ) log() (m + ) m (m + )m m ( m)(m + ) m (m + )mm (m + ) m log() ( log() ( ) m log() ) m + log() Thus, the expected umber of isolated compoets of size m decreases as log() icremet of m. Neglectig the costats, we ca the boud the sum as: with each k i i= u i k i= ( ) i log() ( ) i log() < k = k i=0 log() i=0 which teds to zero as teds to ifiity.

4 CS 3: Network Aalysis - Problem Set Fially, lets study the case of p =. Give ay two odes, the probability that they are ( discoected from the rest ad coected to each other is ( ) ) which is always larger tha e 4. This probability teds to zero, but sice the umber of possible pairs icreases with the umber of odes as O( ), a costat fractio of the odes will form isolated pairs (which step will ot recoect). Problem.3 Geerate a dataset of millio values followig a power-law distributio with expoet.5. The compute experimetally the expoet of the distributio, usig the followig 4 methods: Refer to Power-law distributios i empirical data by Clauset, Shalizi ad Newma for how to geerate radom umbers from a power-law distributio. a) Fittig a lie to the frequecy distributio. b) Fittig a lie to the frequecy distributio with logarithmic biig. c) Usig the complemetary CDF. d) Usig the maximum likelihood estimate. Solutio: 0 6 loglog plot of frequecy 0 6 loglog plot with logarithm biig 0 5 0 5 0 4 0 4 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 3 0 4 0 5 0 0 0 0 0 0 0 3 0 4 0 5 0 6 loglog plot of cdf 0 6 logarithm biig + cdf 0 5 0 5 0 4 0 4 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 3 0 4 0 5 0 0 0 0 0 0 0 3 0 4 0 5 Figure : Plots for expoet estimatio The data is geerated by geeratig a vector r of 0 6 umbers uiformly from [0, ] ad apply the trasformatio x = ( r) /3. We work with the cotiuous model i this problem. The calculatio for discrete model is very similar. See Figure for the plots. (a) By settig bis of width ad doig liear regressio of the frequecies i the loglog scale we get α = 0.94. The problem is that i the tail there are a lot of empty bis, so the

CS 3: Network Aalysis - Problem Set 5 liear regressio fits a flat lie. (b) Let bi i be [. i,. i ]. We cout the frequecy i each bi ad ormalize it by the width of the bi. Now by liear regressio i the loglog scale we get α =.7895. We obtaied a total of 0 bis ad the oise i the tail is ot egligible. If we use oly the first 60 bis for regressio the the aswer is very accurate (α =.507). Also it should be oted that if the couts for each bi is ot ormalized, we get a better estimate α =.364. This is oe of the weird effect of those empty bis. (c) Here we compute the CDF ad do regressio i loglog scale, ad icremet the resulted alpha by. If costat width bis are used as i (a) we get α =.3533. If logarithmic biig is used the α =.4567. (d) Usig the MLE estimate we get α = + [ i= l x i x mi ] =.4983. Problem.4 Cosider the followig evolvig model for geeratig a udirected graph. Iitially there are oly three odes coected ito a triagle. At every time step, a edge of the curret etwork is selected uiformly at radom, ad a ew ode is added to the etwork that liks to both the edpoits of the edge. Prove that p k, the fractio of odes with degree k, follows a power law with expoet 3. Provide a ituitive explaatio as to why this model is the same as the preferetial attachmet model. Solutio: Let d i (t) deote the degree of ode i at time t. Node i oly gets a ew edge at time t+ if oe of his edges is picked. Hece, the expected value of d i (t + ) will be: We ca the approximate E[d i (t + )] = d i (t) ( + 3 + t ) d i (t) t d i(t) 3 + t. Solvig the differetial equatio with the iitial coditio that d i (i) = we obtai d i (t) = ( ) 3 + t. 3 + i Just as we did i class, we ca ow fid which odes have degree higher tha k at time t: i k (3 + t) 3. At time t there are 3+t odes i the etwork, so the desired fractio is p k = (3 + (3+t)k t) 3. This expressio ca be cosidered the cdf (cumulative distributio fuctio) of (3+t) the degrees at time t. By derivatig respect to k ad makig t ted to ifiity, we get the asymptotic probability distributio: p k 8 k 3

6 CS 3: Network Aalysis - Problem Set This model is the same as the preferetial attachmet because i both cases odes the probability that a ode gets a ew edge is proportioal to its curret degree. Problem.5 I this exercise we will study the distributio of words i the Eglish laguage. The data cosists of a list of all the words i a dictioary ad a text versio of A tale of Two Cities by Charles Dickes (foud at project Guteberg). I the later, we have removed puctuatio, apostrophes, etc... keepig oly the 6 characters i the alphabet ad the space. (a) Write a program that reads the list of words provided ad plot a graph showig the umber of words that there exist of legths betwee 3 ad 8 (you ca discard all other words). How fast does such umber icrease? (b) Usig the ovel A Tale of Two Cities as a represetative sample, we ow plot how frequetly each words is used i the Eglish laguage. Sort the words i the ovel alog the x axis from the most frequet to the least, ad plot their umber of appearaces (may words i the dictioary will ot be i the ovel. You should ot take those ito accout). Does it follow a power law? If so, fid a approximatio for the expoet. If you looked further ito the previous plot, you would see that the most frequet words are usually shorter. We ow develop models to explai why, if log words are more umerous i the dictioary, authors use short oes more ofte. (c) Assume that a mokey typed oe billio (0 9 ) radom characters o a keyboard (6 letters + space bar), ad call word ay sequece of letters betwee two spaces. Fid f(), the expected umber of times that a GIVEN sequece of legth would appear i the mokey s text (with spaces at both sides). Does f() follow a power law? If so, fid a approximatio for the expoet. (d) I average, how may times would the 00-th most frequet word appear i the mokey s text? What about the 000-th? (Hit: how log would those words be? Either simulate it or fid a aalytic expressio) Is this a good model for the results i (b)? (e) We will try to further improve the model by assigig differet probabilities to differet characters. Fid the probability of each character (icludig space) i A Tale of Two Cities ad geerate te thousad words accordig to that distributio. Repeat the plot i part (b) for this ew text. Is the model better?

CS 3: Network Aalysis - Problem Set 7 Solutio: (a) The umber of words of a give legth icreases liearly betwee 3 ad 8. 0000 5000 0000 5000 0 3 4 5 7 6 8 (b) Yes, it follows a power law, approximately with expoet -. 5 4 3 0 0 3 4 5 (c) Usig the uio boud, we get f () = 09 76+. Rigorously speakig, it would be slightly smaller, sice this is just a upper boud. It does ot decrease accordig to a power law, but expoetially, as it becomes clear from the previous expressio. (d) I average, ay two letter word will be more frequet tha ay three letter oe, while two words with the same umber of characters have the same chaces of appearig. Therefore, the first 6 most frequet words will be -character oes. The we will have the 6 two letter oes, which will roughly appear f() times. Fially, the 000t h most frequet word will have three characters, ad appear with a frequecy of f(3). It is ot a good model for our data. It is too step-like. Although it is true that the two expoetials cacel each other (icreasig umber of words ad decreasig frequecy) givig a power law, it does ot capture the progressive descet that we observed i (b).

8 CS 3: Network Aalysis - Problem Set.0.5.0 0.5 0.0 0.5 0.5 0.0 0.5.0.5.0.5 3.0 3.5 (e) The model does improve. But there is still a large umber of words that appear just oce. By icreasig the legth of the radomly geerated text we could improve the precisio at the tail..5.0.5.0 0.5 0.0 0.5 0.5 0.0 0.5.0.5.0.5 3.0