Fourier and Stats / Astro Stats and Measurement : Stats Notes

Size: px

Start display at page:

Download "Fourier and Stats / Astro Stats and Measurement : Stats Notes"

Ella Simon
5 years ago
Views:

1 Fourier and Stats / Astro Stats and Measurement : Stats Notes Andy Lawrence, University of Edinburgh Autumn Probabilities, distributions, and errors Laplace once said Probability theory is nothing but common sense reduced to calculation. This is true, but only if we keep a clear head... Stepping through the basics of probability is laborious, but its worth it, because its all too easy to fall into simple traps when doing probability calculations. 1.1 Probability The concept of probability. There is an ongoing debate over the meaning of probabilities - whether they should be seen as the frequency of occurrence of something - the fraction of times a six is rolled - or the degree of belief in something. To some extent it doesn t matter as long as the quantities obey the same calculus of probabilities (see below), but adherents to these two views ask subtly different questions, as we shall see later. It is easier to follow the mathematical trail if we take the frequentist view for now, and consider the alternative view later. Experiments, outcomes, and events. We can refer to any process of observation or measurement as an experiment - rolling a die, or measuring the mass of an electron. The results obtained are outcomes - a four rolled, or a mass of 509keV. The set of all possible outcomes is called the sample space. For tossing a coin, that is just the two outcomes S = {H, T }. An event is a more general result, typically a combination of the elemental outcomes. For example, suppose we roll two dice. The sample space is the set of all outcomes which is S = {(x, y) x = 1, 2..6, y = 1, 2..6} i.e. the set of all 36 pairs of values x, y such that x can be any of 1 6 and likewise for y. But suppose we are interested in how often we get a given total T. The set of all ways we can achieve the event T = 7 is S 7 = {(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)} This is 6 out of the 36 members of the full set of outcomes, so the probability of getting the event T = 7 is P (T = 7) = 1/6. We have assumed that all the outcomes have equal probability; but the various possible events T do not have equal probability. There is only one way of getting a 12, and no ways of getting 1. 1

Stats 1 : Probabilities 2 Figure 1: The difference between exclusive and non-exclusive events. Events A and B are mutually exclusive, whereas events C and D have elemental outcomes in common. 1.2 The calculus of probabilities A probability is a number between 0 and 1.

2 Stats 1 : Probabilities 2 Figure 1: The difference between exclusive and non-exclusive events. Events A and B are mutually exclusive, whereas events C and D have elemental outcomes in common. 1.2 The calculus of probabilities A probability is a number between 0 and 1. The combined probability of all the possible outcomes in the sample space S could be expressed as P (S) = 1. If there are N possible outcomes and each outcome x has the same probability, then P (x) = 1/N. If the outcomes are not all equally likely, then the probability of a given outcome is the fraction of times that outcome occurs. If an event E could occur n different ways out of the N elemental outcomes, then the probability of E is P (E) = n/n. How do we combine probabilities for multiple events? Everybody remembers that you add probabilities if you can have one or the other, and multiply them if you want both; but in fact you have to be very careful about whether events are exclusive, and whether they are independent. As well as being important for getting probability problems right, it turns out to be important later when we look at statistical inference. Exclusive and non-exclusive events. What is the probability of getting either event A or event B? This depends on whether the events are mutually exclusive i.e. whether they can both occur. Suppose we roll two dice. If event A is a total of 7 and B is a total of 5 then A and B cannot both occur. However if B is the event the two numbers include a 1 then A and B can both occur. Rolling 1 and 6 is an example of both events occurring; rolling 2 and 5 is an example of only A occurring; and rolling 1 and 4 is an example of only B occurring. The situation is illustrated graphically by the Venn diagram in Fig. 1. Events A and B are mutually exclusive, and A B = A + B and so P (A or B) = P (A) + P (B) Events C and D are not exclusive, and C D = C + D C D, and so P (C or D) = P (C) + P (D) P (C and D) Of course this second version is generally true; but for exclusive events P (C and D) = 0.

3 Stats 1 : Probabilities 3 Dependence and conditional probabilities. Suppose we pick balls from an urn. (The favourite hobby of statisticians.) Let us say that it contains two red balls and eight black balls. The probability of picking a red ball is 1/5. If we replace the ball and pick again, the probability of picking a red ball a second time is the same, 1/5. The two events are independent. Suppose however we do not replace the first ball. Then the first pick matters - if it was black, the probability of red in the second pick is 2/9; but if the first pick was red, our probability is 1/9. The two events are dependent, or alternatively we can say that the second event is conditional on the first. In general for two events A and B in a sample space S P (A and B) = P (A, B) = P (A) P (B A) where the expression P (B A) is the conditional probability and can be read as the probability of B given A. If the two events are independent then P (B A) = P (B) where P (B) is just the overall fraction of occurrences of B in our sample space S. Bayes theorem. The conditionality works both ways, so we can write This formula can be re-arranged to say P (A, B) = P (A B).P (B) = P (B A) P (A) P (B A) = P (A B) P (B) P (A) In this form, it is known as Bayes Theorem. As it stands, this is a fairly innocuous re-statement of the laws of probability. However later on when we consider parameter estimation and model fitting, it will have very important consequences. 1.3 Probability distributions Random variables. A random variable is a number associated with the outcomes of chance experiments, whose value is different for each measurement or trial. The total T that we roll with a pair of dice is a random variable, as is the height h of a randomly chosen person. The total rolled with the dice is an example of a discrete random variable. It can only take on particular values. The height of a person is an example of a continuous random variable. It can take an infinite number of possible values, but typically within a restricted range - the height of a person cannot be negative for example. The set of all the possible values of the variable is the sample space we discussed above - whether that is a finite set of discrete values, or a countably infinite set of discrete values, or an uncountably infinite set of real values. Probability distributions. Every value of the random variable has an associated probability. For a discrete variable such as T this is the discrete set of values P (T ) associated with T. This mapping from T to P (T ) is known as a probability distribution. For several events A, B, C there will be a probability associated with all of A, B, and C etc occurring which is then a multivariate probability distribution, P (A, B, C...). For a continous variable such as h any specific value strictly speaking has an infinitesimal probability; we have instead to define the density of probability p(h) within a small region, in the usual way such that p(h)dh is the probability of occurrence of a value of in the range h to h + dh. Any such function f(x) is then a probability density function or PDF. When we come to hypothesis testing, it is often important to know the integrated probability over some range. It is customary to refer to the integrated probability for a value less than some value x, i.e.

4 Stats 1 : Probabilities 4 as the distribution function for x. F (x) = x f(x)dx Just like with discrete events, sometimes we want to define a multivariate PDF. For example if there are two random variables x and y we can define p(x, y) such that p(x, y)dxdy is the probability that in a specific trial the two variables fall in the range x to x + dx and y to y + dy. Following the terminology for discrete probabilities, we can write the probability of x given y as p(x y). You can visualise this if you think of p(x, y) as a two dimensional surface, and take a slice through it along x at a fixed value of y. Likewise if we simply write p(x), this is the probability distribution for x if we allow all values of y, and could be calculated as p(x) = p(x, y)dy. This is referred to as marginalisation over y. If x and y are independent then p(x, y) = p(x)p(y). Finally, we can also express the continuous version of Bayes theorem as p(y x) = p(x y)p(y) p(x) Parent vs sample distributions. The logic of a given physical situation implies a very definite probability distribution. If we roll two dice, we know exactly what proportion of the time we will get a total of 2, what proportion a total of 7, etc. For the heights of people, or the masses of galaxies, the situation is much more complicated, and we may not actually know what the true probability distribution is, but our assumption in the physical sciences is that some such underlying process does exist, even if we do not know what it is. However, when we conduct a series of trials, for example rolling two dice fifty times, the actual observed frequency of 2s and 7s will usually not exactly match the expected values. The intrinsic expected distribution is known as the parent or population distribution. The discrete observed distribution is known as the sample distribution, because the observed values are sampled from the underlying true distribution. One of the key problems of physical science is that often we don t know what the underlying process is, but would like to find out. All we have is the sample distribution, and wish to use this to try to understand what is going on. This is the problem of statistical inference, which will occupy us for parts Characterising sample distributions The parent distribution will be defined by some precise mathematical expression. By contrast the sample distribution is simply a list of values x i. For a situation that is intrinsically discrete such as the rolls of a die, there can be many repeat values - five 2s, eight 3s, etc. For a situation that is intrinsically continuous, we can visualise the probability density by binning the values - that is, counting the number of values that fall within a set of discrete ranges - and plotting a histogram. We would like some standard measures to characterise the distribution of sample values without having to know the parent distribution. Measures of location. First we might ask what is the typical value? There are three common ways of translating this qualitative concept into something rigorous. The first is to use the histogram and estimate the most probable value or mode, where the local density of x i values is highest. This is intuitive but sensitive to how you bin the histogram. Another method is the median - the value for which half the values are above and half below. This has the great virtue of being completely robust against transformations of x, but it is hard to work with mathematically. The commonest estimate is the arithmetic mean or average :

5 Stats 1 : Probabilities 5 x = xi N For symmetrical parent distributions, the mean, mode and median will all be the same on average. Measures of dispersion. Next we want to know the spread of values. We could start by calculating the deviation of each point from the mean, x i x. The mean value of this will typically be zero for many parent distributions, which is not helpful. We could find the average of the absolute value, but this is hard to work with mathematically. Instead it is usual to define the sample variance s 2 = (xi x) 2 N The square root of the variance, s, is then the standard deviation. The ratio s/ x is sometimes called the coefficient of variation. Moments of the sample distribution. A very general method of characterising a distribution is by defining its moments. The nth moment of a distribution about an arbitrary point x a is defined as µ a n = 1 N N (x i x a ) n i=1 If we take the first moment about the point x a = 0 we see that µ 0 1 = x. If the point x a is taken to be the mean x then the moments are called the central moments and written without the superscript. We then get µ 0 = 1 µ 1 = 0 µ 2 = s 2 Skewness and kurtosis. The first two moments (not counting the zeroth moment) correspond to the ideas of location and dispersion. The third moment is related to the skewness, or deviation from a symmetrical form, and the fourth moment is related to the kurtosis, or degree of peakiness. To compare different distributions with each other, it is usual to normalise these moments to the dispersion, but there are several different conventions for doing this. Two common definitions are: skewness α 3 = µ 3 µ 3 2 kurtosis α 4 = µ 4 µ Expected values The idea of moments is closely related to the more general concept of expectation. The expected value of a random variable is simply the average value of all its possible values, with due respect to the probability of occurrence of each value. We can define this concept for either discrete or continous random variables: E[X] = X X P (X) or E[x] = x p(x)dx

6 Stats 1 : Probabilities 6 Where P (X) is the set of probabilities of the discrete random variable X, and p(x) is the PDF for the continuous random variable x. You will often see expected values written as x rather than E[x]. Because a function of a random variable is also a random variable, we can calculate the expected value of any function of x E[f(x)] = f(x) p(x)dx Parent distribution moments. The moments of a random variable x can now be defined as the expected values of x n for various n: m n = E[x n ]. (In comparing with discussion above w.r.t. sample distributions, this corresponds to the moment about x = 0.) The zeroth moment m 0 = + p(x)dx = 1 as long as p(x) is properly normalised. The first moment m 1 = E[x] is the mean, which is the expectation value of x. The standard second moment is m 2 = E[x 2 ], but it is more useful to define the centred moment obtained by shifting the origin of x to the mean : µ n E [(x E(x)) n ]. Then the second centred moment is ( µ 2 E (x E(x)) 2) = σ 2 Conventionally we use the symbol σ 2 for the variance of the (theoretical) parent distribution, and s 2 for the variance of the (observed) sample distribution. We can define higher moments in the same fashion. The algebra of expectations. When manipulating moments of distributions, its handy to use some simple rules for how expectation values combine: E[X + Y ] = E[X] + E[Y ] E[X Y ] = E[X] E[Y ] E[aX + b] = ae[x] + b E[aX + by ] = ae[x] + be[y ] Note also that if b is a constant E(b) = b; and because an expected value is itself a constant, having integrated over x, then E(E(x)) is just E(x). A useful result which can be obtained is that ( σ 2 = E (x E (x)) 2) (( = E x 2 2xE (x) + E (x) 2)) = E ( x 2) E (x) 2 In other words, the variance can be obtained from the mean of the square minus the square of the mean. Relation between sample and parent moments. For sample distributions, statistics such as the mean and variance are useful objective summaries of the properties of the observed sample, but they can also be seen as estimates of the corresponding quantities in the parent distribution. (We will look at this more closley in part 3). However, there are in principle different ways of using the sample values to estimate the properties of the parent distribution, so we need to be a little careful. For example, according to the maximum likelihood method which we will discuss later, the best estimate of the parent variance is not s 2 but s 2 N/(N 1). In parts 3 and 5 we will look more closely about how you estimate the parameters of a model.

7 Stats 1 : Probabilities Error analysis From high school physics onwards, we are taught that we should associate uncertainties or errors with our measurements. What does this actually mean? Mistakes and biases. In normal English, an error is some kind of actual mistake. This can certainly apply to physical experiments. If we believe we are measuring a lump of Sodium, and it s actually a lump of Potassium, we will certainly get an incorrect value. There might also be some kind of fixed bias. We measure some effect with a voltmeter, but a problem with our circuitry means that all our measurements have 0.03V added to them. Random errors. Mistakes and biases can in principle be corrected. However, central to the idea of error analysis is that a range of measured values is unavoidable, because the act of measurement is itself a random process. We assume that some property X has a true value, but the measured value x is a random variable. Each time we run the experiment and produce a value of x, it is drawn from some probability distribution p(x) related to the true value X. We should carefully distinguish the idea of randomness in the measurement process from the idea of randomness in actual physical properties. People have a variety of heights, and so the height h of a person selected at random from a large population has a spread. If we measure the height of one specific person, we should in principle get a single definite value. However, when we measure the height of that specific person repeatedly, we actually get a small spread of results, because of randomness in the measurement process. The standard deviation of this error distribution is what we normally use to quantify the sense of error. Later in part 3 we will sharpen this idea and talk about the idea of confidence regions. Systematic vs random errors. If a sample is drawn from the underlying parent error distribution, we can use the sample mean as an estimate of the mean of the parent distribution, which in turn we take to represent the true value. The more measurements we make, the more precise will be our estimate of the parent mean. The error on 30 measurements is smaller than the error on 3 measurements. (We will sharpen that statement in parts 3 and 5). However, if we are dominated by systematic errors, the precision does not improve - the error stays just as big, however many measurements we make. Sometimes both a random error and a systematic error applies, and it is healthy if we quote each separately, rather than a combined error. We can also question the assumption that the mean of the parent error distribution is the same as the true value of the physical quantity concerned. This captures the normal English concept of bias but in a statistical sense. Because of the possibility of bias, accuracy - the difference between true and measured values - is not the same as precision. As we make more measurements and the random error reduces, the accuracy may not improve by much. Evaluating errors. Sometimes we know what is causing the randomness in the measurement process. For example, when an experiment involves counting things, such as the number of photons detected from a star, the number counted is subject to Poisson statistics, which we will look at in part 2. Often however, we do not know the source of error; we then need to estimate the error empirically by making a sequence of measurements and calculating the sample variance. Of course, as Donald Rumsfeld might have said, our problem is not just the known known errors, or the known unknown errors, but the unknown unknown errors. Propagation of errors. Once we have estimated an error on some measured quantity x, or perhaps two quantities x and y, we would like to know what this implies for some derived quantity z - for example in our equipment we actually measure a voltage and a time delay, but use some formula to calculate the value of electron mass that these measurements imply. We can use the usual rules of calculus to transform from one variable to another, and the definitions of expected values to calculate the second moment of the desired variable. The result depends on whether x and y are dependent or independent variables. We will defer a detailed discussion of the issue of dependence to part 4. Meanwhile, without proof, we can state the result for two independent variables.

8 Stats 1 : Probabilities 8 If z = f(x, y) is some function f of random variables x and y, and we know the variance of each of these, what is the variance of z? If the variables are independent then σ 2 z = ( ) f 2 σx 2 + x ( ) f 2 σy 2 y This formula is useful in many places, but in particular it shows us how to propagate errors. We can use the propagation formula to see how this works for a variety of different mathematical relationships. Some examples are : f = ax ± by σ 2 f = a2 σ 2 x + b 2 σ 2 y f = xy or x/y (σ f /f) 2 = (σ x /x) 2 + (σ y /y) 2 f = ax ±b f = a ln (±bx) f = ae ±bx (σ f /f) = b (σ x /x) σ f = a (σ x /x) (σ f /f) = bσ x f = a ±bx (σ f /f) = b ln(aσ x )

STA Module 4 Probability Concepts. Rev.F08 1

STA Module 4 Probability Concepts. Rev.F08 1 STA 2023 Module 4 Probability Concepts Rev.F08 1 Learning Objectives Upon completing this module, you should be able to: 1. Compute probabilities for experiments having equally likely outcomes. 2. Interpret