Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

EE 178 Probabilistic Systems Analysis Spring 2018 Lecture 6 Random Variables: Probability Mass Function and Expectation Probability Mass Function When we introduce the basic probability model in Note 1, we defined three things: 1) the basic random variables; 2) the sample space Ω consisting of all the possible outcomes of the experiment; 2) the probability of each of the outcomes. Usually, a probability model consists of multiple random variables. If we want to focus on just one of the random variables, there are two things important about it: 1) the set of values that it can take ; 2) the probabilities with which it takes on the values. Let a be any number in the range of a random variable X. Since X = a is an event, we can talk about its probability, P(X = a). The collection of these probabilities, for all possible values of a, is known as the probability mass function or distribution of the r.v. X. Definition 6.1 (probability mass function or distribution): The probability mass function (or distribution) of a random variable X is the collection of values {(a,p X (a) = P(X = a)) : a A }, where A is the set of all possible values taken by X. The probability mass function of a random variable can be computed from the probabilities of the outcomes. For example, consider the experiment with two independent rolls of a dice. Let X be the result of the first roll, and Y be the result of the second roll. The sample space is shown in Figure 1 with 36 outcomes. The probability of each outcome is 1/36. Then P(X = 3) is simply the sum of the probabilities of the outcomes in the 3rd column. And P(Y = 2) is the sum of the probabilities of the outcomes in the 2nd row. Hence: P X (a) = P(X = a) = 1 6, a = 1,...,6. P Y (b) = P(Y = b) = 1 6, b = 1,...,6. Note that the collection of events X = a, a A, satisfy two important properties: any two events X = a 1 and X = a 2 with a 1 a 2 are disjoint. the union of all these events is equal to the entire sample space Ω. The collection of events thus form a partition of the sample space. As a consequence, the sum of the probabilities P(X = a) over all possible values of a is exactly 1. So when we sum up the probabilities of the events X = a, we are really summing up the probabilities of all the outcomes. In the dice rolling examples, the events X = 1,X = 2,X = 3,X = 4,X = 5,X = 6 are the six columns of the 2-dimensional sample space and form a partition of it. In the dice rolling example, the probability mass functions of X and of Y are computed from the probability assignment to the outcome. But more often then not, things happen in reverse: a probability model is put EE 178, Spring 2018, Lecture 6 1

Figure 1: The sample space for the example of rolling two dice. The column, row and the diagonal correspond to the three events X = 3, Y = 2 and S = 4 respectively, where S = X +Y together by first specifying the probability mass functions of the individual random variables, and then the whole model is put together by specifying the probabilistic relationship between the random variables. For example, the probabilistic model for the dice rolling example is built by assuming X and Y each has the uniform probability mass function, and the whole model is constructed by assuming X and Y are independent. Defining New Random Variables A probability model is built by defining certain basic random variables. But more often than not, we want to ask questions about other random variables which are not among the basic ones but can be defined in terms of the basic ones. For example, the basic random variables for the dice rolling problem is X, the result of the roll of the first die, and Y, the result of the roll of the second die. But maybe we are not interested in these random variables individually but in their sum S = X +Y. Then S is a random variable on its own right. Just like the basic random variables X and Y, the newly defined random variable S has a probability mass function p S (c) = P(S = c), and this can be computed. In our example P(S = 4) = 3 a=1 P(X = a,y = 4 a) = 3 a=1 P X (a)p Y (4 a) = 1 12. The event S = 4 corresponds to the diagonal subset of outcomes in Figure 1. The entire probability mass function is shown in the table below: a 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 5 1 5 1 1 1 1 P S (a) 36 18 12 9 36 6 36 9 12 18 36 The distribution of a general random variable X, whether it is a basic random variable or a random variable defined in terms of the basic random variables, can be visualized as a bar diagram, shown in Figure 2. The x-axis represents the values that the random variable can take on. The height of the bar at a value a is the probability P(X = a). Each of these probabilities can be computed by looking at the probability of the corresponding event in the sample space. EE 178, Spring 2018, Lecture 6 2

Figure 2: Visualization of how the distribution of a random variable is defined. The bottom part of the figure refers to an example to be discussed in a later lecture. EE 178, Spring 2018, Lecture 6 3

Bionomial Distribution Suppose you have n balls and select k out of the n balls. How many such subsets of k balls exist? We denote the number of such subsets by ( ( n k). To compute n k) in terms of n and k, we consider the number of ways to rearrange the n items. This is given by n! which is equal to n (n 1)...2 1. Another way to find the number of rearrangements is to first fix the first k elements in ( n k) ways and then rearrange within the first k and the last n k elements. Therefore, we should have ( ) n n! = k!(n k)! k and therefore, ( ) n n! = k k!(n k)! The binomial distribution is one of the most important distributions in probability. It can be defined in terms of a coin-tossing experiment. Consider n independent tosses of a biased coin with Heads probability p. Define the random variable X i = 1 if the ith flip is a Heads, and X i = 0 otherwise, Let X be the number of Heads. Note that X can be defined in terms of the basic random variables X i s: X = X 1 +... + X n. To compute the distribution of X, we first enumerate the possible values X can take on. They are simply 0,1,...,n. Then we compute the probability of each event X = i for i = 0,...,n. The probability of the event X = i is the sum of the probabilities of all the outcomes with i Heads. Any such outcome has a probability p i (1 p) n i. There are exactly ( n i) of these outcomes. So ( ) n P(X = i) = p i (1 p) n i i = 0,1,...n (1) i This is the binomial distribution with parameters n and p. A random variable with this distribution is called a binomial random variable (for brevity, we will say X Bin(n, p)). An example of a binomial distribution is shown in Figure 3. Although we define the binomial distribution in terms of an experiment involving tossing coins, this distribution is useful for modeling many real-world problems. Consider for example the problem of reliable data storage in the face of hard disk failure. The technology is called RAID. (See http://en.wikipedia.org/wiki/raid.) Reliability is provided by adding redundancy and using error-correction coding: the data is distributed across n disks and can be recovered as long as no more than k disks fail. (The parameters n and k depend on the level of RAID used.) Assuming each disk fails independently with probability p, the number of disk failures X is binomial distributed with parameters n and p. So the probability of unrecoverability of the data is given by : P(X > k) = n i=k+1 ( ) n p i (1 p) n i. i For a given value of p, we can choose k large enough such that this probability is no less than, say, 0.99. EE 178, Spring 2018, Lecture 6 4

Figure 3: The binomial distributions for two choices of (n, p). Joint Probability Mass Functions The pmf of a random variable X summarizes all the probabilistic information about it. When we have two random variables X and Y, the events of interest are X = a and Y = b for all possible values of (a,b) that (X,Y ) can take on. Thus, a natural generalization of the notion of pmf to multiple random variables is the following. Definition 6.2 (joint pmf (distribution)): The joint pmf (distribution) of two discrete random variables X and Y is the collection of values {(a,b,p X,Y (a,b) := P(X = a,y = b)) : (a,b) A B}, where A and B are the sets of all possible values taken by X and Y respectively. This notion obviously generalizes to three or more random variables. In fact, the probability assignment to the outcomes of the sample space can be viewed as the joint pmf of all the basic random variables X 1,...,X n defining the probability model. Just like the distribution of a single random variable, the joint distribution is normalized, i.e. P X,Y (a,b) = 1. a A,b B This follows from noticing that the events X = a,y = b (where a ranges over A and b ranges over B) partition the sample space. The joint distribution between two random variables fully describes their statistical relationships. Moreover, the individual distributions of X and Y can be recovered from the joint distribution as follows: P X (a) = P X,Y (a,b) a A, (2) b B P Y (b) = P X,Y (a,b) b B. (3) a A The first follows from the fact that the events Y = b (where b ranges over B) form a partition of the sample space Ω, and so the events X = a and Y = b (where b ranges over B) are disjoint and their union yields the event X = a. The second fact follows for similar reasons. Pictorially, one can think of the joint distribution values as entries filling a table, with the columns indexed by the values that X can take on and the rows indexed by the values Y can take on (see Figure 4). To get the EE 178, Spring 2018, Lecture 6 5

Figure 4: A tabular representation of a joint distribution. distribution of X, all one needs to do is to sum the entries in each of the columns. To get the distribution of Y, just sum the entries in each of the rows. This process is sometimes called marginalization and the individual distributions are sometimes called marginal distributions to differentiate them from the joint distribution. Note that in general, the individual distributions of X and Y alone do not fully specify the joint distribution. However, in the special case when X and Y are independent, then we have P X,Y (a,b) = P X (a)p Y (b). EE 178, Spring 2018, Lecture 6 6