Probability theory for Networks (Part 1) CS 249B: Science of Networks Week 02: Monday, 02/04/08 Daniel Bilar Wellesley College Spring 2008 1
Review We saw some basic metrics that helped us characterize network graphs G(E,N) N (number of nodes), E (number of edges) More metrics Vertex i in/out degree k i How many edges go in/out a particular vertex i? Clustering coefficient C i of vertex i How close is neighborhood of vertex i to being a clique? Geodesic distance geo i,j between vertices i and j The shortest path connecting i and j We may also look at <l>, the average geodesic distance between vertex pairs in G Diameter diam(g) of the graph The longest geodesic path through the network 2
More questions Are there quantifiable relationships between N,E, k i, C i, <l>, diam(g), geo i,j? We figured out two already Recall that for undirected graphs, we figured out E <= N(N-1)/2 and Sum(k i ) = 2E Some examples of further questions: 1. Can one estimate the fraction of nodes having degree k and above in a network G? 2. What affects the average clustering coefficient C? Is it N? Is it k? Both? Neither? We will need some knowledge of probability theory and statistics to formulate the questions and understand the answers 3
Learning goals the coming lectures Understanding what a probability distribution and a cumulative density function means Probability terms Experiment, Outcome, Sample Space, Random Variable PDF (PMF) CDF Understanding a log-log plot Understanding the pertinence of probabilities for our study of networks 4
Probability A measure of how likely it is that some event will will occur Terms used Experiment Process of obtaining an outcome Sample space Set of all possible outcomes Outcome Some result, something happening Event Subset of sample space, what we are interested in Experiment Toss 2 fair voins Select 1 card out of standard deck Select nodes of degree k in a simple graph Sample Space HH, HT, TH, TT (4 outcomes) 2, 2,..., A (52 outcomes) 0,1,2,3,4,5,, k max (k max +1 outcomes) Note: Interpretation of probability (two major schools) An event Different face A queen Fraction of k i >4 1. Frequentist ( objective, physical) Probability constitutes the limit of its relative frequency in a large number of experiments 2. Bayesian ( subjective, evidentiary) Probability is about belief in a statement which can be updated as new data comes along We ll interpret probabilities the frequentist way most of the time 5
Construction a sample space Experiment: Flip 3 coins H 1 H 2 T 2 H 2 T 1 T 2 H 3 H 1 H 2 H 3 T 3 H 1 H 2 T 3 H 3 H 1 T 2 H 3 T 3 H 1 T 2 T 3 H 3 T 1 H 2 H 3 T 3 T 1 H 2 T 3 H 3 T 1 T 2 H 3 T 3 T 1 T 2 T 3 Sample space (the set of all possible outcomes) What is the fraction of outcomes that have just one tail? 3 out of 8 (equally likely, mutually exclusive) outcomes So the probability is P(E = {one tail}) = 3/8 6
Random Variables (RV) A Random Variable assigns a numerical value to each outcome of an experiment A RV takes on different possible values corresponding to experiment outcomes Example: Tossing two coins Sample space = {TT, HT, TH, HH} Assign random variable X = {1,2,3,4} to each outcome 7
Examples of Random Variables Let X be a random variable. What will X represent, and what are the possible values of X in the following? 1. Fill level on a standard bottle of Absolut 750ml ± 5ml, so X can be any real number from 745-755ml Continuous RV 2. Lifetime (1000 s of hours) of a light bulb X can be any real number from 0 to almost infinity Continuous RV 3. Degree count in a simple graph with n vertices X can be integer range [0,n(n-1)/2] Discrete RV 8
Random Variables Discrete random variable: A limited number of possible values can be assumed Very often counts (i.e number of things: degree, potato) Finite or (countably) infinite number of possible values Continuous random variable: Can assume all values in some interval of the real number line Very often measurements (i.e extent of things:height, weight) (Uncountably) infinite number of possible values Specific value of random variable can (almost never) be predicted with certainty but distribution of possible values can known.. i.e. the likelihood that R.V. X takes on certain values 9
Distribution of a random variable Let simulate measuring 500 bottles of Absolut and plot the frequency count vs the measured fill level R.v. X = measured fill level Measurement results: ml Range Frequency 700 [0,700] 3 710 (700,710] 6 720 (720,730] 25 730 (730,740] 40 740 (740,750] 67 750 (750,760] 102 760 (760,770] 99 770 (770,780] 83 780 (780,790] 45 790 (790,800] 25 800 (800,infinity] 5 frequence count 120 100 80 60 40 20 0 Frequency plot of Absolut vodka fill levels 700 710 720 730 740 750 760 770 780 790 800 ml Frequency If we normalize the frequency counts (divide by 500) so that they sum to 1, we get a discrete probability distribution p(x) for X Any measurement in the range interval gets counted in the left bin (ml) 10
Probability Distributions Describes probabilities of values a random variable could take Examples: Coin flipping P(X=x) = ½ if x={0,1} 0 otherwise Tossing a die P(W=w) = 1 / 6 if w={1,2,3,4,5,6} 0 otherwise 1 0.8 0.6 0.4 0.2 0 0.2 0.15 0.1 0.05 0 0 1 1 2 3 4 5 6 11
PMF (Probability Mass Function) RVs can be of two types: Discrete or Continuous Discrete random variable => probability mass function (pmf) The probability mass function (pmf) p(r) of discrete rv X is defined as: p(r) = p(x = r), for r = 0, 1, 2,... where 1. Probability of each state occurring 0 p(r) 1, for every r; 2. Sum of all states p(r) = 1, for all r. 12
PDF (Probability Density Function) Note: Continuous random variable => probability density function (pdf) X contains an infinite number of values Mathematically, X is a continuous rv variable if there is a function f, called probability density function (pdf) of X that satisfies the following criteria: 1. f(x) 0, for all x 2. f(x)dx = 1. The pdf f(x) is NOT a probability measure (unlike the pmf)! Since f(x) is defined for an infinite number of points over a continuous interval, the probability at a single point is always zero Probabilities are measured over intervals, not single points Let s see this in the next slide 13
Ex: Measuring a metal cylinder Suppose that the diameter of a metal cylinder has a PDF: f ( x) P( X = x) = 2 1.5 6( x 50.2) for 49.5 x 50.5 0, elsewhere Check whether this is a valid PDF (i.e integrates to 1): 50.5 49.5 (1.5 6( x 50.0) ) dx= [1.5x 2( x 50.0) ] 2 3 50.5 49.5 3 [1.5 50.5 2(50.5 50.0) ] = 3 [1.5 49.5 2(49.5 50.0) ] = 75.5 74.5= 1.0 49.5 50.5 14 x
Measuring a metal cylinder (cont.) What is the probability that a metal cylinder has a diameter between 49.8 and 50.1 mm? f ( x) Integrate the interval between 49.8 and 50.1 The probability is ~43 % 50.1 49.8 (1.5 6( x 50.0) ) dx = [1.5x 2( x 50.0) ] 2 3 50.1 49.8 = 49.5 49.8 50.1 50.5 3 [1.5 50.1 2(50.1 50.0) ] 3 [1.5 49.8 2(49.8 50.0) ] = 75.148 74.716 = 0.432 15 x
Cumulative Distribution Function If we sum up the area under a PDF, we get a so-called Cumulative Distribution Function F( x) = P( X x) = f ( y) dy df( x) f ( x) = dx x f(x) P( a x b) = f ( y) dy b a f(y) is the PDF of a continuous RV (hence the integral) F(x) is the CDF a b x 16
CDF of a continuous RV Example: PDF f(x) of metal cylinder diameter Probability diameter is between 49.7 and 50mm ~39 % f x = x x 2 ( ) 1.5 6( 50.2) for 49.5 50.5 f ( x) = 0, elsewhere 1 P(49.7 X 50.0) = 0.396 P(49.7 X 50.0) = PX ( 50.0) = 0.5 F( x) = F(50.0) F(49.7) = 3 (1.5 50.0 2(50.0 50.0) 74.5) 3 (1.5 49.7 2(49.7 50.0) 74.5) = 0.5 0.104= 0.396 PX ( 49.7) = 0.104 49.5 49.7 50.0 50.5 17 x
CDF of a discrete RV Example: PMF of cost x of machine breakdown F( x) 1.0 P( X = x) = 0.3 0.2 0.5 0 0<= x <= 50 50 < x <= 200 200 < x <= 350 otherwise F( x) = P( X x) 0.5 CDF 0.3 F ( x ) = P ( X = y ) y : y x 0 50 200 350 x($cost) PMF 18
Some RV calculations Mean E(X) is expected (average) value: Continuous x Discrete x Variance Var(X) is expected square distance from mean: Standard deviation: σ= Var(X) 19
For next time Review notes, start reading book 20