Probability - Lecture 4 - PDF Free Download

1 Introduction Probability - Lecture 4 Many methods of computation physics and the comparison of data to a mathematical representation, apply stochastic methods. These ideas were first introduced in the theory of dilute gases, and later more fully developed in statistical mechanics. The application of statistical analysis is based on the observation that processes can be described by the laws of probability. The laws of probability are based on empirical observation. Any conclusion which results from a measurement is uncertain, and we use imprecise data to determine the value of some parameter or to establish a hypothesis. This has two possibilities when applying probabilities. 1. We wish to determine which theory is most compatible with the data. Thus the theory is conditioned by credibility and aesthetics. We have previously discussed Occam s razor, for example. 2. We wish to determine which values of a parameter set are more compatible with a theory. In this case, the measurement is conditioned by the theory. In the case of 1 above, the theory itself is determined by statistical analysis, so a firm conclusion cannot be made. In the case of 2, there is always some uncertainty in the measurement. In general, the application of statistics to analysis introduces uncertainty, either in the theory or in the parameters. 2 Definitions The outcome of any observation can depend on many parameters, some measured and some unknown. In any experiment, one tries to hold as many parameters constant, especially those which primarily effect the result. The fluctuation in parameters leads to uncertainty in the measurement and produces a range of results for a number of identical measurements. This range of values is called a statistical distribution. In a simple example, a perfect die is cast N times. Some of the unknowns are the way the die is held, thrown, and/or various environmental parameters. The resulting frequency of the integers that are observed, gives a distribution of discrete sample points. This is illustrated in Figure 1 1

N/6 1 2 3 4 5 6 Figure 1: An example of a discrete probability distribution of a die thrown N times N = dx n(x) L X Figure 2: An example of a continuous distribution obtained by measuring a length On the other hand, suppose one obtains a distribution of measurements of the length of a rod. This distribution is continuous, and the area under the curve equals N. A continuous distribution is illustrated in Figure 2. The word random is used to describe the fluctuations in the results of processes described above. However, random is difficult to define. It is easiest to think in terms of the continuous distribution of a variant as in Figure 2, although application to a discrete example will also be obvious. Define the probability as the limit in the ratio of the number of times a particular observation occurs divided by the total number of observations, as the number of observations. In this case, a uniform distribution has equal probabilities for each observation for each interval of the variant. Thus suppose an interval, X, of the variant, X, where X is either discrete or continuous. Then the outcome of any observation which falls in the interval x is described as random. Although the results of an individual outcome cannot be predicted, a statistical distribution of a set of observations can be modeled and predicted. Note in the case of the discrete distribution, if the number of observations within any integer interval X = (n + 1) n is not equal, the distribution is not random. The above definitions are a frequentist approach to probability. This is the classical understanding of probability and statistics, and is closer to scientific reasoning since it can be determined independently of the observer. It is related to the frequency of the occurrence of an event, but it is restricted to repeatable observations. 2

There is another definition of probability called the Bayesian approach or subjective probability. The Bayesian approach is interpreted as the degree of belief that an observation will occur. For hypothesis testing and parameter estimation, the numerical results are essentially the same as the frequentist approach. However, an exact frequentist approach requires as input, probabilities of all data, including data observed and data not observed. Obviously an impossibility. An exact Bayesian approach requires an input of all prior beliefs of the observer, and this also is not possible and is subjective. For testing goodness of fit of a single hypothesis, one cannot obtain results in the Bayesian approach. On the other hand, decision theory is subjective, and cannot be handled by the frequentist. There is also confusion in the language of statistics. We apply the language of physics here, but that of the statistician is sometimes different. See Table 1 for examples. Table 1: Language comparison of definitions between a physicist and a statistician Physics Determine Estimate Data Data Size Statistics Estimate Guess Sample Population Finally the following ISO definitions are useful as these are the defined standard. Probability - Probability is the expectation of the occurrence of an event. It is a normalized number between 0 and 1. The number 0 means there is no chance of occurring, and the number 1 means that it will always occur. Event - a labeled occurrence written as X i where i is the event number. X is the total number of events in a set and X is the number of events outside the set X. Thus; P(X) + P(X) = 1. Uncertainty - Uncertainty is defined as the parameter associated with a measurement which characterizes the dispersion of values that could reasonably be attributed to the measurement. Note the word reasonably. Error - Error is the result of a measurement minus the true value of the parameter. As such, this is not really known as the true value of any measurement is unknown. 3

True Value - The true value is the exact value of a given parameter Mathematical Probability - The abstract mathematical concept defined by a set of laws or axioms (Kolmogorov axioms). For a set of events X i which are exclusive, ie if one occurs then no other event is possible, then the axioms state; 1. P(x i ) 0 2. P(X i or X j ) = P(X i ) + P(X j ) 3. i P(X i ) = 1 Frequentist Probability - For a set of n events which are observed during a total of N observations. Then; P(X) = lim N n N Obviously, this last axiom is not possible, but it does show that the freqentist approach requires repeatable experiments. Thus one cannot use the frequentist approach to predict whether the sun will rise tomorrow using the fact that it has risen for the last 4 million years. The reason the sun shines today is not why is shone yesterday. Bayesian Probability - Bayesian probability is based on the belief that an event will occur. This is defined by what is called the coherent bet. The coherent bet is a number between 0 and 1 which an observer places that an event will occur. It has the properties that; 1. The bet depends as much on the observer as the system under observation 2. It depends on the knowledge of the observer and can change as knowledge is gained 3 Examples The physical review has a new policy which is intended to increase the significance of the articles that are published. In a paper, experimenters fit their results using an incorrect 4

hypothesis, but extract a parameter with better precision than previously. The paper is published. Statistically the error is small, but the published result is completely incorrect. The stock market generates volumes of data and one can always find correlations if one selectively looks for them. These back testing models have been used by various money managers without knowledge of statistics. To prove irresponsible data mining, one statistician tested correlations of the stock market with the annual butter production in Bangladesh and reproduced the S&P return correctly 75% of the time. If he included the production of cheese in the US and the population of sheep in Bangladesh, he was correct 99% of the time over a 10 year period. Remember Garbage in = Garbage out As a final example, suppose a test for H1N1 virus has a probability of being 100% positive if someone has the H1N1 virus and a probability of 0.2% positive if someone does not have the virus. Does this mean a person with a positive test is 99.8% likely to have the virus? The answer is no. Look at this more closely. Suppose a population of 10 8 people and a chance of infection of 0.1% in this population. If everyone were tested one would find 10 5 infections, and [10 8 10 5 ]x0.002 2x10 5 positive tests in uninfected people. Then the probability that a positive test is a result of infection is; P(Infection/[H1N1 + H1N1]) = 10 5 10 5 + 2 10 5 = 0.33 This later example requires an understanding of conditional probability which is discussed in the next section. 4 Combining Probabilities The best way to understand how probabilities are combined to visualize a combination of sets of elements. A pictorial diagram is shown in Figure refaunionb which illustrates the intersection of two sets A and B. Probability is defined as a continuous measure on the interval between 0 and 1. If an event has no chance of occurring it has probability 0, if it always occurs it has probability 1. Success is the occurrence of an observation, and a success of event X is given a probability, P(X). A failure is written P(X) so that as previously, P(X) + P(X) = 1 Then suppose two non-exclusive sets such as A and B in the figure. The probability of an event occurring which is contained in either A or B is given by; P(A or B) = P(A) + P(B) P(A B) 5

A B A B A B Figure 3: The intersection of the sets A B and the union of these sets A B Obviously the subtraction of P(A B) prevents double counting this intersection. For a number n of mutually exclusive events, X i we have; P(X) = i P(X i ) In set theory; P(X 1 X 2 X n ) = i P(X i ) As an example, suppose one asks for the probability of drawing a face card from a deck of cards with replacement and reshuffling before each card is drawn. The probability of drawing any card is 1/52 and as there are 4 cards the probability of drawing a jack is 4/52. There are 3 face cards so the probability of drawing a face card is (3x4)/52. The observation here is that drawing any card has the same probability, and is the occurrence of any draw is random. The probability of drawing the full set of cards is normalized to 1. The conditional probability of the occurrence of an event A given that event B has occurred is written, P(A/B). This is the probability that an event known to belong to the set B, is also in the set A. This is defined by; P(A and B) = P(A B) = P(A/B)P(B) = P(B/A)P(A) The sets A and B are independent if; P(A/B) = P(A) The occurrence of B is irrelevant to the occurrence of A. Thus if A is independent of B; 6

P(A B) = P(A) P(B) The expression that links P(A/B) to P(B/A) above is Bayes theorem. It is stated; P(A/B) = P(B/A) P(A) P(B) Then if A i are exclusive and exhaustive sets (ie each event must belong to one and only one of the sets A i ), then Bayes theorem becomes; P(A i /B) = P(B/A i) P(A i ) P(B/A i ) P(A i ) i The denominator is the normalization for all possibilities for the occurrence of B. Now as an example, suppose an experiment is designed to look for decays of a particle. It is desired to know if one counter is sufficient to detect the decay in the presence of a background. Let; P(B) be the probability of any event in the detector P(A) be the probability of the occurrence of the event which is sought P(B/A) is the probability that a true event gives a signal in the detector P(A/B) is the probability of a true event is an event is observed P(B) is measured by turning on the detector to observe events. P(A) is assumed known from other experiments. P(A/B) is calculated from measured efficiencies of the detector. This then determines P(A/B) the probability that is desired. However, when A or B is not a set of events but an hypothesis, then the interpretation is not so clear. This case will be discussed later. 5 Probability of a sequence of events Suppose one wants the probability of a sequence of successes in a probability set, ie the probability of P(X 1 ) and P(X 1 ) and and P(X N ). Begin by looking for the probability of when events, X 2, occurs after event 1, X 1. This is that part of X 2 which in included in 7

X 1, or X 1 X 2. P(X 1 X 2 ) = P(X 2 /X 1 )P(X 1 ) = P(X 1 /X 2 )P(X 2 ) In the above, P(X 2 /X 1 ) is the conditional probability that X 2 will occur after X 1 has occurred. This leads to the result; P(X 2 /X 1 ) = P(X 1 X 2 ) P(x 1 ) Note that this probability is normalized by the probability, P(X 1 ), ie the probability that both X 1 and X 2 occur divided by the probability that X 1 occurs. If the occurrence of X 1 and X 2 are independent, then; P(X 2 /X 1 ) = P(X 1)P(X 2 ) P(X 1 ) = P(X 2 ) In the general case when the occurrence of each event is independent; P(X T ) = P(X 1 )P(X 2 ) P(X N ) Consider the following example. We wish to find the probability of drawing a red jack from a randomly shuffled deck of cards. The fact that the card is a jack and the color is red are independent. Thus; P T = P(drawing a Jack)P(drawing a red card) = (4/52)(1/2) = 1/26 Obviously one could obtain this result by counting possibilities using the probability of drawing one card from a deck of only red cards, 2/52. This is a frequentist approach to obtaining the probability. Now suppose the events are not mutually exclusive. Previously we found; P(X 1 X 2 ) = P(X 1 ) + P(X 2 ) P(x 1 ) P(X 2 ) We use this to obtain the probability of 2 cards drawn with replacement from a randomly shuffled deck. We wish that at least one of the cards is a jack. The possibilities are shown in the probability tree in Table 2. Obviously there are 4 possibilities and three lead to success. We cannot add probabilities here because the probability of drawing one jack is not mutually exclusive in all draws, ie the first event is included in both X 1 and X 2. The result is most easily obtained by subtracting the double failure which is counted only once in Table 2, but we may also use the above equation to evaluate P(X 1 X 2 ). The probability of a failure to draw a jack is (1 4/52). A double failure to draw a jack is the square of this factor, so the probability of drawing at least one jack is; 8

P(at least 1 jack) = 1 [1 4/52] 2 = (2/13)[1 1/13] Then using the union of X 1 and X 2 one obtains; P(X 1 X 2 ) = P(X 1 ) + P(X 2 ) P(X 1 ) P(X 2 ) Substituting values; P(at least 1 jack) = [4/52] + [4/52] ([4/52][4/52]) = (2/13)[1 1/13] The last term is the probability of drawing 2 jacks and contains the overlap of the 2 sets, X 1 and X 2. 6 Bayes theorem We previously obtained an expression for the conditional probability of event X 2 when it occurs after event X 1. This was stated in Bayes theorem. P(X 2 /X 1 )P(X 1 ) = P(X 1 /X 2 )P(X 2 ) = P(X 1 X 2 ) Note that P(X 2 /X 1 ) can be different from P(X 1 /X 2 ). For example, the probability of a physics professor given all people wearing sandals is not the same as the probability of wearing sandals given all physics professors. Bayes theorem can be used to update beliefs (probabilities) using the axioms of combining probability. Return to the probability for drawing a jack. Let X 1 represent success on the first try and X 2 success on the second. Then ; P(X 1 /X 2 ) = P(X 1 X 2 ) P(X 1 ) = (4/51)(4/52) (4/52) = (4/52) Table 2: The probability tree of drawing a jack on 2 separate draws from a random deck with replacement First Draw Jack Drawn (success) Jack Drawn (success) Jack Drawn (failure) Jack Drawn (failure) Second Draw Jack Drawn (success) Jack not Drawn (failure) Jack Drawn (success) Jack Drawn (failure) 9

As expected the probability of drawing a jack on the second draw is independent of what happened on the first draw, assuming replacement and reshuffling. 7 Model Likelihood Now consider a model. M with parameters, w. We want to fit data by varying the parameters of the model. First determine the Likelihood (ie given a set of parameters w and a model, M, determine the probability of a data set, D). This is written P(D/wM), which is the probability of the data given the model and parameters.use Bayes theorem to write; P(w/DM) = P(D/wM)P(w/M) P(D/M) What we are trying to find is the probability of the parameter set given the model and the data. The term on the right is the Likelihood, P(D/wM) and P(w/M) is the probability distribution of the parameters without the data. The denominator is the normalization. P(D/M) = k P(D/w k M)P(w k /M) The above equation relates a prior probability, P(w/M), to a posterior probability P(w/DM) by conditioning the prior probability with the Likelihood. In most cases one can ignore the normalization term as we want the shape of the distribution. As another example, you meet a friend in a pub and agree that the one with the lowest value on a roll of a die will pay for the beer. The friend always wins. What is the probability that he cheats? There are two hypothesis, 1) Cheats (C), or 2) Honest (H). Begin by assuming that the probability of cheating is low. Propose that a cheater always wins, thus P(C) = 1 and honest behavior has a probability P(H) = (1/2). Assume the die is rolled n times. Thus P(w n /C) = (1/2) n. Use by Bayes theorem; P(C/w n ) = P(w n /C)P 0 (C) P(w n /C)P 0 (C) + P(w n /C)P 0 (H) In the above, P 0 are the initial assumptions. Thus; P(C/w n ) = 1 P 0 (C) 1 P 0 (C) + 2 n P 0 (H) Running the probabilities by updating the priors (P 0 ) Table 3 is produced. Thus Bayesian probability allows an updating probability as events are collected. Note that the result after a number of tries becomes independent of the initial guess. Finally look at the probability of throwing two dice one showing the number 1 and in addition the 10

sum of the numbers showing on the faces is an odd. We can do this by the frequentist approach using counting. Table 4 gives the 6 6 = 36 possibilities. In the Table 4, the bold faced pair of numbers show the possibilities of having an odd sum of the face value of the dice with at least one die that shows a 1. There are 6 possibilities out of a set of 36, so the probability is 1/6. Let A be an event when the sum of the numbers of dice is odd, P(A) = 1/2. Let B be an events having a number 1 showing in at least one die. Using Bayes theorem; P(A B) = P(A/B)P(B) = P(B/A)P(A) = (6/11)(11/36) = 1/6 Also we find that; P(A B) = (1/2) + 11/36 1/6 = 23/36 By counting, there are 36 possibilities and 6 3 = 18 events where the sum of the faces is odd, and 11 have at least one ace. Also 6 events have one ace and the sum of the faces is odd. Then P(A) = 18/36 and P(B) = 11/36 Table 3: The probability of cheating when throwing dice and always winning P 0 (C) un-normalized P(C/w n ) n = 5 n = 10 n = 15 n = 20 0.01 24 91 99.7 99.99 0.05 63 98 99.94 1.00 0.50 97 99.9 1 1 Table 4: Possible outcomes when rolling 2 dice 11 12 13 14 15 16 21 22 23 24 25 26 31 32 33 34 35 36 41 42 43 44 45 46 51 52 53 54 55 56 61 62 63 64 65 66 11

8 Combinatorial analysis Consider a set of N events, X i, which are labeled, X 1, X 2, X N. Choose k of these to form a population of events. Make the choice with replacement so that each selection has N possibilities. The number of possible populations is N k. If the choice is made without replacement, there are N possible, (N 1) second choices, etc. Thus there are, N(N 1) (N k + 1) possible populations. Thus we also see that the number of ways N elements can be reordered is N! (N elements chosen without replacement). What is the probability that no event appears twice in a population (ie no repetition of elements occurs in a set)?. P = N(N 1) (N k + 1) N k As the above expression will be used many times in the future introduce the notation; P = (N) k N k = N(N 1) (N k + 1) N k As an example, random numbers were once selected from the final digits in numbers extracted from a set of mathematical tables with many decimal places. You may test to see if 10 numbers are random by comparing the probability that three (or any reasonable number) of digits are different when selecting a number of sequential digits from some table. One could also select numbers from a non-repeating fraction or some irrational number like π or e. The probability that one obtains 3 different numbers is P 3 = 10 9 8 10 3 = 0.72 To perform a sequence test for randomness, consider the number, e = 2.71828. The first 800 decimals form 160 groups of 5 digits each. Arrange these in 16 groups of ten, 5 digit numbers. The number of subgroups in each of the 16 groups which have all 5 digits different are shown in Table 5. Table 5: The number of subgroups which have 5 separate digits as selected from the sample of 160 groups of 5 digits extracted from the first 800 digits representing e Group Number Group Number Group Number Group Number Number w/o repeats Number w/o repeats Number w/o repeats Number w/o repeats 1 3 5 4 9 4 13 5 2 1 6 1 10 2 14 4 3 3 7 4 11 3 15 6 4 4 8 4 12 1 16 3 12

The total number of sub-groups with different digits is (sum of the numbers) 52 and there are 160 such groups, so the probability for a group to not have a repeating number is; P T = 52/160 = 0.325 Then apply the combinatorial probability as previously developed. P Tc = (10) 5 10 5 = 10 9 8 7 6 10 5 = 0.302 This test, at least to its statistical accuracy, indicates the number string is random. Now consider the birthdays of k people which form a sample size from the number of days in a year. Assume all years have 365 days and that birthdays are equally probable on any day. The probability that in this sample size all birthdays are different is ; P D = (365) k (365) k For k = 23 we find that P D = 1/2. Thus for 23 people the probability of at least 2 people having the same birthday is 1 P D = 1/2 Suppose we choose k elements of a population of size N. We know that there are (N) k samples. Now the k elements can be arranged in k! ways. If X is the number of populations of size K the number of ordered samples is X k! (number of samples times the number of ways to order them). X k! = (N) k X = (N) k k! = N(N 1) (N K + 1) k! = ( N k The last term defines the( binomial ) coefficient. Out of a group of N elements we can choose N a group of k elements in different ways. The elements are indistinguishable. k ( ) 52 As an example, there are = 2, 598, 960 different ways to distribute cards in 5 card 5 stud poker. What( is the) probability that a hand contains 5 different card values? The cards 13 can be chosen in ways and there are 4 suits for each card. Thus; 5 ) ) P C = ( 13 4 5 5 ( ) = 0.507 52 5 13

As another example, suppose 5 balls (1,2,3,4,5), and pick 3. This results in the following combinations; (123), (134), (145), (234), (245), (345), (124), (125), (135) The above also has a multiplication factor if the balls are reordered within each selection. This gives at total 10 3! = 60 possibilities but only 10 distinguishable ones if the balls are indistinguishable within each group of 3. We count the number of possibilities by recognizing that we can choose the first ball in 5 ways, the second in (5-1=4) ways, and the third in (5-1-1=3) ways. Thus we write the possibilities as; N T = 5 4 3 = 60 To generalize, write this as ; ( ) ( N 5 N T = = k 3 ) = 5 4 3 3! = 10 Then there are 3! = 6 ways to reorder the balls within each group. Using the frequency definition of probability, we develop probability distributions for various ways to distribute a set of events into cells. In the case that the balls are distinguishable; P = 3! (5 4 3/3!) = 6/10 9 Occupancy Problems Use the model of placing k elements into N cells to determine the probability of various occupancy distributions (ie the probability of finding a particular distribution of k elements among N cells. There are N k possible distributions each with an equal probability of 1/N k. Cell occupancy is defined by the notation, k = k 1 + k 2 + k N. Here k i represents the number of elements in the i th cell. The elements are indistinguishable. The resulting distributions are indistinguishable only if the ordered n-tuples (k 1, k 2,, k N ) are identical. We wish to find the number of possible distributions. Graphically let * represent an element and let a cell be represented by the space between two vertical lines. An example of a distribution is ; In the graphical representation above, the total number of interior lines is 7 and the total number of * is also k = 7. There are N = 8 cells with occupancy (3, 1, 0, 0, 0, 1, 0, 2). Thus we obtain a particular distribution by selecting k positions out of a possible (N 1) + k 14

( ) N + k 1 places. From previous considerations this yields; distributions. In the case k when only one particle per cell is allowed, the n k ( < N ) and the number of ways to put k N elements in N cells with no more than 1 per cell is. k As an example, the partial derivatives of a function do not depend on the order in which they( are taken. Let ) each variable correspond to a cell and select k derivatives. Thus there N + k 1 are partial derivatives of k k th order. Thus a function of 3 variables has 15 derivatives of 4 th order. ( ) ( ) N + k 1 3 + 4 1 Number = = = 15 k 4 Finally consider the number of ways a population of N elements can be divided into k ordered parts having elements k 1, k 2, k j with k j = N. This is given by; j Number = N! k 1!k 2! k j! It there is to be no( order) with in the groups, choose k 1 elements out of the N, so the number N of populations is;. When this is continued for all populations; k ( )( ) ( ) N N k1 N k1 k 2 k1 k 2 k ( ) 3 N k1 k j 2. k j 1 As an example 52 cards are partitioned into 4 equal groups. Thus one has; Number = 52! (13!)(13!)(13!)(13!) There are then 4! ways to distribute one ace to each group. ( ), ( ), ( ), ( ) This leaves 48 cards which must be divided among the 4 groups as described above. The probability is then; P = 4![(48!)/(12!)]4 (52!)/(13!) = 0.105 15

10 Moments Suppose we have a set of N random events, X i, and probability of their occurrance f i. The mean value of these events is defined by; µ = (1/N) N i=1 f i X i Then define the moments of this set of events as; µ r = (1/N) N i=1 f i [X i µ] r Thus when 0 = 1 by completeness; µ 0 = (1/N) N For other moments; i=1 f i = 1 µ 1 = (1/N) N i=1 µ 2 = (1/N) N i=1 f i [X i µ] = 0 f i [X i µ] 2 = σ 2 The spread of the distribution of events is measured by the variance, σ 2. The 3 rd and 4 th moments measure the asymmetry and central peak relative to the distribution tails. A symmetrical distribution has µ 3 = 0. Postitive or negative values represent asymmetric tails to right or left, respectively. This moment is called the skew of the distribution. The 4 th moment divided by µ 2 is called the Kurtosis - peakness - of the central value of the distribution with respect to the tails. As an example suppose one has a distribution of events from a roll of a die. This produces a number from 1 to 6 with probability 1/6. A large number of events would equally populate all possible numbers. Thus the mean, µ = (1/N) j(1/6) = 21N 6N = 3.5. The variance is µ 2 = (1/N) (j 3.5) 2 (1/6) = 2.917 The moments of a discrete distribution as discussed above can be extended to a continuous distribution. In this case f i f(x) and X i g(x). The function, f(x) is a probability density function. Then; dxf(x) = 1 dxf(x) g(x) = E(g) 16

E(g) is the expectation value of the function, g(x). The expectation value for the random variable, X, is µ = dxx f(x) The variance is; σ 2 = E([x µ] 2 ) = E(x 2 ) µ 2 The standard deviation is the square root of the variance, σ = σ 2. The variance may not always exist as the integral may not converge. 17