Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables.

Size: px

Start display at page:

Download "Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables."

Lawrence Logan
5 years ago
Views:

1 Lecture 5 A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring Probability, CLT, CLT counterexamples, Bayes The PDF file of this lecture contains a full reference document on probability and random variables.

2 Notions of Probability Qualitative: (Knowledge) As a measure of the degree to which propositions, hypotheses, or quantities are known. The measure can be an intrinsic property of an event in and of itself or relative to other events. eg. It is not probable that the sun will explode as a supernova is an absolute statement based on our background knowledge of stellar evolution. eg. Horse A is more probable to win, place, or show than horse B is a relative statement that includes the constraint that one horse wins to the exclusion of any other horse in the race. Bayesian inference makes use of this qualitative form of probability along with quantitative aspects discussed below. In this sense, Bayesian methods attribute probabilities to more entities than do some of the more formalistic approaches. 1

3 Quantitative: (Frequentist approach: random variables and ensembles) Classical: Probabilities of events ζ in an event space S are found, a priori, by considerations of possible outcomes of experiments and the manner in which the outcomes may be achieved. For example, a single-die yields probabilities of 1/6 for the outcomes of a single toss of the die. A problem with the classical definition (which often calculates probabilities from line segment lengths, areas, volumes, etc.) is that they cannot be evaluated for all cases (eg. an unfair die). Relative Frequency: The probability of an event ζ is defined to be the limit of the frequency of occurrence of the event in N repeated experiments as N : P {ζ} = lim N σ N N. The problem with frequencies is that in real experiments N is always finite and therefore probabilities can only be estimated. Such estimates are a poor basis for deductive probabilistic theories. 2

4 Axiomatic: A deductive theory of probability can be based on axioms of probability for allowed events in an overall event space. The axioms are: For ζ = an element in the event space S and P (ζ) = probability of the event ζ the following hold: i) 0 P (ζ) 1 ii) P {S} =1(S = set of all events, so P {S} is the probability that any event will be the outcome of an experiment. iii) If ζ 1 and ζ 2 are mutually exclusive then the probability of ζ 1 + ζ 2 ( the event that ζ 1 or ζ 1 occurs) is P (ζ 1 + ζ 2 )=P (ζ 1 )+P (ζ 2 ). These axioms imply further that: iv) If ζ = event that ζ does not occur, then P ( ζ) =1 P (ζ). v) If ζ 1 is a sufficient condition for ζ 2, then P (ζ 1 ) P (ζ 2 ). Equality holds when ζ 2 is also a necessary condition for ζ 2. vi) Let P (ζ 1 ζ 2 ) = probability of the event that ζ 1 and ζ 2 occurs (overlap in Venn diagram). Mutually exclusive events have P (ζ 1 ζ 2 )=0while generally P (ζ 1 ζ 2 ) 0. In general we also have P (ζ 1 + ζ 2 )=P (ζ 1 )+P (ζ 2 ) P (ζ 1 ζ 2 ) P (ζ 1 )+P (ζ 2 ) 3

6 Conditional Probabilities: Also satisfying the axioms are conditional probabilities. Consider the probability that an event ζ 2 occurs given that the event ζ 1 occurs. It may be shown that this probability is Similarly, Bayes Theorem: P {ζ 2 ζ 1 } = P {ζ 1ζ 2 } P {ζ 1 }. P {ζ 1 ζ 2 } = P {ζ 2ζ 1 } P {ζ 2 }. Solving for P {ζ 1 ζ 2 } from the two preceding equations and setting these solutions equal yields Bayes theorem P {ζ 2 ζ 1 } = P {ζ 1 ζ 2 }P {ζ 2 }. P {ζ 1 } 4

7 Bayesian Inference: Bayesian inference is based on the preceding equation where, with relaxed definitions of the event space, hypotheses and parameters are attributed probabilities based on knowledge before and after an experiment is conducted. Bayes theorem combined with the qualitative interpretation of probability therefore allows the sequential acquisition of knowledge (ie. learning) to be handled. The implied temporal sequence of events, by which data are accumulated and the likelihood of a hypothesis being true increases or decreases, represents the power of the Bayesian outlook. Moreover, with Bayesian inference, assumptions behind the inference are often brought up front as conditions upon which probabilities are calculated.

8 Probability on One Page 1. PDF and CDF: f X (x) has unit area and F X (x) is in [0, 1]; f X (x) = df X(x) dx 2. Characteristic function: Φ X (ω) = dx f X (x)e iωx e iωx 3. Moments: X n = i n d n Φ X (ω) dω n ω=0 (moment generating function) 3. Change of variable: y = g(x) with solutions x j = g 1 (y),j =1,...,n. Probability is conserved: f Y (y)dy = f X (x)dx: n f X (x j ) f Y (y) = dg(x)/dx x=xj NxN transformation: the derivative is replaced by the determinant of the Jacobian matrix. 4. Sum of independent RVs: Z = X + Y: Easily extendable to a sum of N RVs. 5. Conditional PDFs: f X (x y)f Y (y) =f XY (x, y) 6. Bayes Theorem: j=1 f Z (z) = f X f Y Convolution = Φ Z (ω) = Φ X (ω)φ Y (ω) Product = σ 2 Z = σ 2 X + σ 2 Y Variances add f Y (y x) = f X(x y)f Y (y) f X (x) 1

9 Generating Pseudo Random Numbers Cf. Section 5.13 of Gregory Several methods: 1. transformation method (exponential example) E.g. X = g(y) 2. CDF mapping of uniformly distributed numbers Works for arbitrary output PDF 3. Measurements of natural systems Higher-order statistics (e.g. specifying a power spectral shape): other methods are needed, to be discussed.

10 U in [0, 1] Arbitrary output PDF

13 All three measures are localization measures Other quantities are needed to measure the width and asymmetry of the PDF, etc.

14 Functions of a random variable: The function Y = g(x) is a random variable that is a mapping from some event A to a number Y according to: Theorem, if Y = g(x), then the PDF of Y is Y (A) =g[x(a)] f Y (y) = n j=1 f X (x j ) dg(x)/dx x=xj, where x j,j =1,nare the solutions of x = g 1 (y). Note the normalization property is conserved (unit area). This is one of the most important equations! Example 1 Y = g(x) =ax + b To check: show that dy f Y (y) =1 dg dx = a g 1 (y) = x 1 = y b a f X (x 1 ) f Y (y) = dg(x 1 )/dx = a 1 f X ( y b a ). 7

15 Example 2 Suppose we want to transform from a uniform distribution to an exponential distribution: We want f Y (y) =exp( y). A typical random number generator gives f X (x) with 1, 0 x<1; f X (x) = 0, otherwise. Choose y = g(x) = ln(x). Then: dg dx x 1 = 1 x = g 1 (y) =e y Factoid: Poission events in time have spacings that are exponentially distributed f Y (y) = f X[exp( y)] 1/x 1 = x 1 =e y. 8

16 Moments We will always use angular brackets < > to denote average over an ensemble (integrating over an ensemble); time averages and other sample averages will be denoted differently. Expected value of a random variable with respect to the PDF of x: E(X) X = dx xf X (x) Arbitrary power: X n = dx x n f X (x) Variance: σ 2 x = X 2 X 2 Function of a random variable: If y = g(x) and Y dy y f Y (y) then it is easy to show that Y = dx g(x)f X (x). Proof: y dy f Y (y) = dy n j=1 f X [x j (y)] dg[x j (y)]/dx A change of variable: dy = dg dx yields the result. dx Central Moments: µ n = (X X) n 9

17 Moment Tests: Moments are useful for testing hypotheses such as whether a given PDF is consistent with data: E.g. Consistency with Gaussian PDF: kurtosis k = µ 4 µ 2 3=0 2 skewness parameter γ = µ 3 µ 3/2 2 k>0 4th moment proportionately larger larger amplitude tail than Gaussian and less probable values near the mean. =0 10

18 Uses of Moments: Often one wants to infer the underlying PDF of an observable, e.g. perhaps because determination of the PDF is tantamount to understanding the underlying physics of some process. Two approaches are: 1. construct a histogram and compare the shape with a theoretical shape. 2. determine some of the moments (usually low-order) and compare. Suppose the data are {x j,j =1,N} 1. One could form bins of size x and count how many x j fall into each bin. If N is large enough so that n k =#points in the k-th bin is also large, then a reasonably good estimate of the PDF can be made. (But beware of dependence of results on choice of binning.) 2. However, often times N is too small or one would like to determine only basic information about the shape of the distribution (is it symmetric?), or determine the mean and variance of the PDF or test whether the data are consistent with a given PDF (hypothesis testing). 11

19 Gaussian case: Some typical situations are: i) assume the data were drawn from a Gaussian parent PDF; estimate the mean and σ of the Gaussian [parameter estimation] ii) test whether the data are consistent with a Gaussian PDF [moment test] note that if the r.v. is zero mean then the PDF is determined solely by one parameter: σ f X (x) = 1 2πσ 2 e x2 /2σ 2 The moments are x n = (n 1)σ n (n 1)!! σ n n even 0 n odd Therefore, the n =2moment = 1st non-zero moment all other moments. This statement remains for more multi-dimensional Gaussian processes: Any moment of order higher than 3 is redundant... or can be used as a test for gaussianity. 12

20 Characteristic Function: Of considerable use is the characteristic function Φ X (ω) e iωx dx f X (x) e iωx. If we know Φ X (ω) then we know all there is to know about the PDF because is the inversion formula. f X (x) = 1 2π dω Φ X (ω) e iωx If we know all the moments of f X (x), then we also can completely characterize f X (x). Similarly, the characteristic function is a moment-generating function: Φ X (ω) =e iωx (iωx) n (iω) n = X n n=0 n! n=0 n! because the expectation of the sum = sum of the expectations. By taking derivatives we can show that or Φ ω ω=0 = ix 2 Φ ω 2 ω=0 = i2 X 2 k Φ ω k ω=0 = in X n X n = i n n Φ ω n ω=0 =( i) n n Φ ω n ω=0 Price stheorem Characteristic functions are useful for deriving PDFs of combinations of r.v. s as well as for deriving particular moments. 13

21 Joint Random Variables Let X and Y be two random variables with their associated sample spaces. The actual events associated with X and Y may or may not be independent (e.g. throwing a die may map into X; choosing colored marbles from a hat may map into Y ). The relationship of the events will be described by the joint distribution function of X and Y : and the joint probability density function is F XY (x, y) P {X x, Y y} f XY (x, y) 2 F xy (x, y) x y (a two dimensional PDF) Note that the one dimensional PDF of X, for example, is obtained by integrating the joint PDF over all y: f X (x) = dy f XY (x, y) which corresponds to asking what the PFf of X is given that the certain event for Y occurs. Example: flip two coins a and b. Let heads =1; tails =0. Define 2 r.v. s: X = a + b; Y = a. With these definitions X + Y are statistically dependent. Characteristic function of joint r.v. s: Φ XY (ω 1, ω 2 ) = e i(ω1x+ω2y ) = dx dy e i(ω1x+ω2y) f XY (x, y). For x, y independent Φ XY (ω 1, ω 2 )= dx f X (x) e iω1x dy f Y (y) e iω2y Φ X (ω 1 ) Φ Y (ω 2 ). Example for independent r.v. s: flip two coins a and b. As before, heads = 1 and tails = 0, let x = a, y = b (x and y are independent). 14

22 Independent random variables Two random variables are said to be independent if the events mapping into one r.v. are independent of those mapping into the other. In this case, joint probabilities are factorable so that F XY (x, y) = F X (x) F Y (y) f XY (x, y) = f X (x) f Y (y). Such factorization is plausible if one considers moments of independent r.v. s: X n Y m = X n Y m which follows from X n Y m dx dy x n y m f XY (x, y) = dx x n f X (x) dy y m f Y (y). 15

23 Convolution theorem for sums of independent RVs If Z = X + Y where X, Y are independent random variables, then the PDF of Z is the convolution of the PDFs of X and Y : f Z (z) =f X (x) f Y (y) = dx f X (x) f Y (z x) = dx f X (z x) f Y (x). proof: By definition, Consider Now, as before, this is f Z (z) = d dz F Z(z) F z (z) =P {Z z} F Z (z) = P {X + Y z} = P {Y z X}. To evaluate this, first evaluate the probability P {Y z x} where x is just a number. Now P {Y z x} F Y (z x) z x dy f Y (y) but P {Y z X} is the probability that Y z x for all values of x so we need to integrate over x and weight by the probability of x: P {Y z X} = dx f X (x) z x dy F Y (y) that is, P {Y z X} is the expected value of F Y (z x). By the Leibniz integration formula d db g(b) a dω h(ω) h(g(b)) dg(b) db we obtain the convolution results. 16

24 Characteristic function of Z = X + Y For X, Y independent we have f Z = f X f Y Φ Z (ω) =e iωz = Φ X (ω) Φ Y (ω) Variance of Z: if variance of X and Y are σ 2 X, σ 2 Y, then variance of Z is σ 2 Z = σ 2 X + σ 2 Y. Assume X and Y and hence Z are zero mean r.v. s, then we have σ 2 X = x 2 = i 2 2 φ x ω 2 (ω =0) = 2 φ x ω 2 (ω =0) σ 2 Y = y 2 = 2 φ y ω 2 (ω =0) Using Price s theorem: σz 2 = Z 2 = 2 φ Z (ω =0) ω2 = 2 ω 2 [φ X(ω) φ Y (ω)] ω=0 = φ Y φ X ω ω + φ Y 2 φ Y = φ X ω 2 + φ Y 2 φ x ω 2 φ X ω ω=0 +2 φ X ω φ Y. ω ω=0 We have discovered that variances add (independent variables only): σ 2 Z = σ 2 X + σ 2 Y. 17

25 Multivariate random variables: N dimensional: The results for the bivariate case are easily extrapolated. If Z = X 1 + X X N = N where the X j are all independent r.v. s, then X j j=1 f Z (z) =f X1 f X2... f XN and and Φ Z = N j=1 Φ Xj (ω) σz 2 N σx 2 j. j=1 18

26 Central Limit Theorem The Central Limit Theorem is a powerful tool for observational science because it can be used to invoke a priori distributions for measured quantities. In simple terms, any quantity that is the sum of many independent ones will have Gaussian statistics. We need to understand what constitutes independence and how Gaussian statistics play out for stochastic processes, which may be viewed as sequences of a large number of random variables. In addition, the CLT does not always apply! Consider the sum Z N = 1 N N j=1 X j where the X j are independent, identically distributed (iid) random variables with means and variances µ j X j σ 2 j = X 2 j X j 2. and the PDFs of the X j s are almost arbitrary. Restrictions on the distributions of each X j are that i) σ 2 j >m>0 m = constant ii) X n <M= constant for n>2 In the limit N,Z N becomes a Gaussian random variable with mean and variance Z N = 1 N µ j, σz 2 = 1 N j=1 N N j=1 σ 2 j. 1

27 Example of arithmetic sums of uniformly distributed numbers Consider the sum x = 1 N N x j j=1 where the x j are drawn from a uniform PDF: x j [ 1/2, 1/2]. The figure shows counts of x for averages of length N =1, 2, 4, 8 and 16 for 100 realizations. 2

28 Example of arithmetic sums of uniformly distributed numbers (continued) Histograms based on 10 5 realizations are shown in the figure below. Two important features are: 1. the width of the PDF scales as N 1/2 2. the shape of the PDF tends toward a Gaussian form with x =0and σ x = x 2 1/2 1/ 12N. 3

29 Cauchy random numbers A Cauchy distribution is Its characteristic function is α 1 f X (x) = π α 2 + x 2 Φ X (ω) =e α ω. Using uniformly distributed numbers u [ 1/2, 1/2], Cauchy random numbers x can be generated using the transformation x =tan(πu). Check: Using the change of variable theorem, where has been used. f X (x) = f u (u(x) = 1 1+x 2, du dx dx du = d du tan πu = 1 cos 2 πu =1+x2 4

30 Example of arithmetic sums of Cauchy distributed numbers Histograms based on 10 5 realizations are shown in the figure below. Here neither the width nor the shame of the PDF changes as N gets larger! What s happening? 5

31 First let s understand the case of summing uniformly distributed numbers Again consider X j that are all uniformly distributed between ± 1 2. The PDF of a single random variable is f X (x) =Π(x) and it is a Fourier transform pair with its characteristic function or f X (x) =Π(x) Φ j (ω) =e iωx j = sin ω/2 ω/2 sin πf πf = sin ω 2 ω/2 Graphically: 6

32 X j ) and use the convolution theorem to evaluate the character- Now consider sums of N RVs of this type (i.e. j istic function of the sum: Graphically: Gaussian N =2 N =3 N = e x2 ( sin ω 2 ω/2 )2 sin ω/2 ( ω/2 )3 e ω2 7

33 We need to rescale the sum to find the characteristic function of defined earlier. From the convolution results we have Z N = 1 N N j=1 x j φ sin ω/2 NZ N (ω) = ω/2 From the transformation of random variables we have that N f ZN (x) = Nf NZ N ( Nx) and by the scaling theorem for Fourier transforms φ ZN (ω) =φ ω sin ω/2 N NZ N N = ω/2 N N. 8

34 If the CLT holds: Now or lim N φ Z N (ω) =e 1 2 ω2 σ 2 Z f ZN (x) = 1 2πσ 2 Z e x2 /2σ 2 Z. Consistency with this limiting form can be seen by expanding φ ZN for small ω ω/2 N 1 φ ZN (ω) 3! (ω/2 N) 3 ω/2 N ω 2 1 N 24 that is identical to the expansion of exp ( ω 2 σz/2). 2 9

35 Why the CLT does not work for a sum of Cauchy variables: The Cauchy distribution and its characteristic function are f X (x) = α π Φ(w) = e α ω 1 α 2 + x 2 In this case has a characteristic function Z N = 1 N N j=1 x j Φ N (ω) =e Nα ω / N By inspection the exponential will not converge to a Gaussian. Instead, the sum of N Cauchy RVs is a Cauchy RV. Is the Cauchy distribution a legitimate PDF? No! The variance diverges: X 2 = dx x2α π 1 α 2 + x 2. The Cauchy distribution is an example of a stable distribution, defined as one that has the property that a linear combination of two independent copies of the variable has the same distribution to within a location and a scale parameter. The family of stable distributions is sometimes called the Levy alpha-stable distribution. (Levy flights, etc.) There is a generalized CLT involving PDFs with long power- law tails h"p://en.wikipedia.org/wiki/lévy_distribu:on 10 h"p://en.wikipedia.org/wiki/stable_distribu:on

36 Conditional Probabilities & Bayes Theroem We have considered P (ζ), the probability of an event ζ. Also obeying axioms of probability are conditional probabilities: P (ψ ζ), the probability of the event ψ given that the event ζ has occurred. Recast the axioms as P (ψ ζ) P (ψζ) P (ζ) I. P (ψ ζ) 0 II. P (ψ ζ)+p( ψ ζ) =1 III. P (ψζ η) = P (ψ η)p (ζ ψη) = P (ζ η)p (ψ ζη) 19

37 From Bayes Theorem to Bayesian Inference How does this relate to experiments? Use the product rule: P (ζ ψη) = P (ζ η)p (ψ ζη) P (ψ η) or, letting M = model (or hypothesis), D = data and I = background information (assumptions), Terms: prior: P (M I) P (M DI) =P (M I) P (D MI) P (D I) sampling distribution for D: P (D MI) (also called likelihood for M) prior predictive for D: P (D I) (also called global likelihood for M or evidence for M) 20

38 Particular strengths of Bayesian method include: 1. One must often be explicit about what is assumed about I, the background information. 2. In assessing models, we get a PDF for parameters rather than just point estimates. 3. Occam s razor (simpler models win, all else being equal) is easily invoked when comparing models. We may have many different models, M i that we wish to compare. Form the odds ratio: from the posterior PDFs P (M i DI) O i,j P (M i DI) P (M j DI) = P (M i I) P (D M i I) P (M j I) P (D M j I). 21

39 Full document on Probability and Random Variables

Probability and Random Processes Experiments Set up certain conditions to which the possible outcomes are called events. The event space S is the set of all outcomes or events ζ i,i=1,n.

40 Probability and Random Processes Experiments Set up certain conditions to which the possible outcomes are called events. The event space S is the set of all outcomes or events ζ i,i=1,n. Events may or may not be quantitative e.g. the experiment may consist of choosing colored marbles from a hat. Detections are experiments designed to answer the question: Is effect present in this physical system? Measurements are experiments designed to yield quantitative measures of some physical phenomenon. Measurements are simply a highly structured form of interaction with a physical system. As such, they are never precise. Estimation of physical parameters is the best one can do, even for values of fundamental constants. Probability The notion of probability arises when we wish to consider the likelihood of a given event occurring or if we wish to estimate the number of times an identifiable event will occur if we repeat a given experiment N times. event space: events ζ i S: Events are possible outcomes of experiments. Events can be combined to define new events. The set of all events is the event space. 1

41 As such, probability is a theoretical quantity and is not the same as the frequency of occurrence of an event in repeated trials of an experiment. Of course, one can estimate the probabilities from repeated trials. We will consider probability to be the underpinning of experiments and we will require it to behave according to three axioms: Let ζ be an event in S, then i) 0 P (ζ) 1 ii) P (S = space of all events) =1 iii) If two events are mutually exclusive [i.e. the occurrence of ζ does not influence the occurrence of ψ], then the probability of the event ζ + ψ = event that ζ or ψ occurs is P (ζ + ψ) =P (ζ)+p (ψ) (+ means or ) From the axioms, one can construct such results as: 1. Ā = event that A does not occur P (Ā) =1 P (A) 2. If the occurrence of A is a sufficient condition for B occurring, [A B but B may occur when A does not] then i P (A) P (B) 3. P (A + B) =P (A)+P (B) P (AB ) where P (AB) = probability that both A and B occur. P (AB) 0, with equality when A, B are mutally exclusive. P (A + B) P (A)+P (B). 2

and (same as a b) P (d) =P {a and b occurred} =0 if mutually exclusive II.

42 I. Mutually exclusive events: If a occurs then b cannot have occurred. Let c = a + b + or (same as a b) P (c) =P {a or b occurred} = P (a)+p(b) Let d = a b and (same as a b) P (d) =P {a and b occurred} =0 if mutually exclusive II. Non-mutually exclusive events: P (c) =P {a or b} = P (a)+p (b) P (ab) III. Independent events: P (ab) P (a)p (b) 3

43 Examples I. Mutually exclusive events toss a coin once: 2 possible outcomes H & T H & T are mutually exclusive H & T are not independent because P (HT)=P{heads & tails} =0so P (HT) = P (H)P (T ). II. Independent events toss a coin twice = experiment The outcomes of the experiment are 1st toss 2nd toss H 1 H 2 H 1 T 2 T 1 H 2 T 1 T 2 events might be defined as: H 1 H 2 = event that H on 1st toss, H on 2nd H 1 T 2 = event that H on 1st toss, T on 2nd T 1 H 2 = event that T on 1st toss, H on 2nd T 1 T 2 = event that T on 1st toss, T on 2nd note P (H 1 H 2 )=P (H 1 )P (H 2 ) [as long as coin not altered between tosses] 4

44 Random Variables Of interest to us is the distribution of probability along the real number axis: Random variables assign numbers to events or, more precisely, map the event space into a set of numbers: a X(a) event number The definition of probability translates directly over to the numbers that are assigned by random variables. The following properties are true for a real random variable. 1. Let {X x} = event that the r.v. X is less than the number x; defined for all x [this defines all intervals on the real number line to be events] 2. the events {X =+ } and {X = } have zero probability. (Otherwise, moments would not be finite, generally.) 5

45 Distribution function: (CDF = Cumulative Distribution Function) properties: F X (x) =P {X x} P {all events A : X(A) x} 1. F X (x) is a monotonically increasing function of x. 2. F ( ) =0, F (+ ) =1 3. P {x 1 X x 2 } = F (x 2 ) F (x 1 ) Probability Density Function (PDF) Properties: f X (x) = df X(x) dx 1. f X (x) dx = P {x X x + dx} 2. dx f X(x) =F X ( ) F X ( ) =1 0=1 Continuous RVs: derivative of F X (x) exists x Discrete random variables: use delta functions to write the PDF in pseudo continuous form. e.g. coin flipping 1 heads Let X = 1 tails then f X (x) = 1 [δ(x +1)+δ(x 1)] 2 F X (x) = 1 [U(x +1)+U(x 1)] 2 6

46 All three measures are localization measures Other quantities are needed to measure the width and asymmetry of the PDF, etc.

47 Functions of a random variable: The function Y = g(x) is a random variable that is a mapping from some event A to a number Y according to: Theorem, if Y = g(x), then the PDF of Y is Y (A) =g[x(a)] f Y (y) = n j=1 f X (x j ) dg(x)/dx x=xj, where x j,j =1,nare the solutions of x = g 1 (y). Note the normalization property is conserved (unit area). This is one of the most important equations! Example 1 Y = g(x) =ax + b To check: show that dy f Y (y) =1 dg dx = a g 1 (y) = x 1 = y b a f X (x 1 ) f Y (y) = dg(x 1 )/dx = a 1 f X ( y b a ). 7

48 Example 2 Suppose we want to transform from a uniform distribution to an exponential distribution: We want f Y (y) =exp( y). A typical random number generator gives f X (x) with 1, 0 x<1; f X (x) = 0, otherwise. Choose y = g(x) = ln(x). Then: dg dx x 1 = 1 x = g 1 (y) =e y Factoid: Poission events in time have spacings that are exponentially distributed f Y (y) = f X[exp( y)] 1/x 1 = x 1 =e y. 8

49 Moments We will always use angular brackets < > to denote average over an ensemble (integrating over an ensemble); time averages and other sample averages will be denoted differently. Expected value of a random variable with respect to the PDF of x: E(X) X = dx xf X (x) Arbitrary power: X n = dx x n f X (x) Variance: σ 2 x = X 2 X 2 Function of a random variable: If y = g(x) and Y dy y f Y (y) then it is easy to show that Y = dx g(x)f X (x). Proof: y dy f Y (y) = dy n j=1 f X [x j (y)] dg[x j (y)]/dx A change of variable: dy = dg dx yields the result. dx Central Moments: µ n = (X X) n 9

50 Moment Tests: Moments are useful for testing hypotheses such as whether a given PDF is consistent with data: E.g. Consistency with Gaussian PDF: kurtosis k = µ 4 µ 3/2 3=0 2 skewness parameter γ = µ 3 µ 3/2 2 k>0 4th moment proportionately larger larger amplitude tail than Gaussian and less probable values near the mean. =0 10

51 Uses of Moments: Often one wants to infer the underlying PDF of an observable, e.g. perhaps because determination of the PDF is tantamount to understanding the underlying physics of some process. Two approaches are: 1. construct a histogram and compare the shape with a theoretical shape. 2. determine some of the moments (usually low-order) and compare. Suppose the data are {x j,j =1,N} 1. One could form bins of size x and count how many x j fall into each bin. If N is large enough so that n k =#points in the k-th bin is also large, then a reasonably good estimate of the PDF can be made. (But beware of dependence of results on choice of binning.) 2. However, often times N is too small or one would like to determine only basic information about the shape of the distribution (is it symmetric?), or determine the mean and variance of the PDF or test whether the data are consistent with a given PDF (hypothesis testing). 11

52 Gaussian case: Some typical situations are: i) assume the data were drawn from a Gaussian parent PDF; estimate the mean and σ of the Gaussian [parameter estimation] ii) test whether the data are consistent with a Gaussian PDF [moment test] note that if the r.v. is zero mean then the PDF is determined solely by one parameter: σ f X (x) = 1 2πσ 2 e x2 /2σ 2 The moments are x n = (n 1)σ n (n 1)!! σ n n even 0 n odd Therefore, the n =2moment = 1st non-zero moment all other moments. This statement remains for more multi-dimensional Gaussian processes: Any moment of order higher than 3 is redundant... or can be used as a test for gaussianity. 12

53 Characteristic Function: Of considerable use is the characteristic function Φ X (ω) e iωx dx f X (x) e iωx. If we know Φ X (ω) then we know all there is to know about the PDF because is the inversion formula. f X (x) = 1 2π dω Φ X (ω) e iωx If we know all the moments of f X (x), then we also can completely characterize f X (x). Similarly, the characteristic function is a moment-generating function: Φ X (ω) =e iωx (iωx) n (iω) n = X n n=0 n! n=0 n! because the expectation of the sum = sum of the expectations. By taking derivatives we can show that or Φ ω ω=0 = ix 2 Φ ω 2 ω=0 = i2 X 2 k Φ ω k ω=0 = in X n X n = i n n Φ ω n ω=0 =( i) n n Φ ω n ω=0 Price stheorem Characteristic functions are useful for deriving PDFs of combinations of r.v. s as well as for deriving particular moments. 13

54 Joint Random Variables Let X and Y be two random variables with their associated sample spaces. The actual events associated with X and Y may or may not be independent (e.g. throwing a die may map into X; choosing colored marbles from a hat may map into Y ). The relationship of the events will be described by the joint distribution function of X and Y : and the joint probability density function is F XY (x, y) P {X x, Y y} f XY (x, y) 2 F xy (x, y) x y (a two dimensional PDF) Note that the one dimensional PDF of X, for example, is obtained by integrating the joint PDF over all y: f X (x) = dy f XY (x, y) which corresponds to asking what the PFf of X is given that the certain event for Y occurs. Example: flip two coins a and b. Let heads =1; tails =0. Define 2 r.v. s: X = a + b; Y = a. With these definitions X + Y are statistically dependent. Characteristic function of joint r.v. s: Φ XY (ω 1, ω 2 ) = e i(ω1x+ω2y ) = dx dy e i(ω1x+ω2y) f XY (x, y). For x, y independent Φ XY (ω 1, ω 2 )= dx f X (x) e iω1x dy f Y (y) e iω2y Φ X (ω 1 ) Φ Y (ω 2 ). Example for independent r.v. s: flip two coins a and b. As before, heads = 1 and tails = 0, let x = a, y = b (x and y are independent). 14

55 Independent random variables Two random variables are said to be independent if the events mapping into one r.v. are independent of those mapping into the other. In this case, joint probabilities are factorable so that F XY (x, y) = F X (x) F Y (y) f XY (x, y) = f X (x) f Y (y). Such factorization is plausible if one considers moments of independent r.v. s: X n Y m = X n Y m which follows from X n Y m dx dy x n y m f XY (x, y) = dx x n f X (x) dy y m f Y (y). 15

56 Convolution theorem for sums of independent RVs If Z = X + Y where X, Y are independent random variables, then the PDF of Z is the convolution of the PDFs of X and Y : f Z (z) =f X (x) f Y (y) = dx f X (x) f Y (z x) = dx f X (z x) f Y (x). proof: By definition, Consider Now, as before, this is f Z (z) = d dz F Z(z) F z (z) =P {Z z} F Z (z) = P {X + Y z} = P {Y z X}. To evaluate this, first evaluate the probability P {Y z x} where x is just a number. Now P {Y z x} F Y (z x) z x dy f Y (y) but P {Y z X} is the probability that Y z x for all values of x so we need to integrate over x and weight by the probability of x: P {Y z X} = dx f X (x) z x dy F Y (y) that is, P {Y z X} is the expected value of F Y (z x). By the Leibniz integration formula d db g(b) a dω h(ω) h(g(b)) dg(b) db we obtain the convolution results. 16

57 Characteristic function of Z = X + Y For X, Y independent we have f Z = f X f Y Φ Z (ω) =e iωz = Φ X (ω) Φ Y (ω) Variance of Z: if variance of X and Y are σ 2 X, σ 2 Y, then variance of Z is σ 2 Z = σ 2 X + σ 2 Y. Assume X and Y and hence Z are zero mean r.v. s, then we have σ 2 X = x 2 = i 2 2 φ x ω 2 (ω =0) = 2 φ x ω 2 (ω =0) σ 2 Y = y 2 = 2 φ y ω 2 (ω =0) Using Price s theorem: σz 2 = Z 2 = 2 φ Z (ω =0) ω2 = 2 ω 2 [φ X(ω) φ Y (ω)] ω=0 = φ Y φ X ω ω + φ Y 2 φ Y = φ X ω 2 + φ Y 2 φ x ω 2 φ X ω ω=0 +2 φ X ω φ Y. ω ω=0 We have discovered that variances add (independent variables only): σ 2 Z = σ 2 X + σ 2 Y. 17

58 Multivariate random variables: N dimensional: The results for the bivariate case are easily extrapolated. If Z = X 1 + X X N = N where the X j are all independent r.v. s, then X j j=1 f Z (z) =f X1 f X2... f XN and and Φ Z = N j=1 Φ Xj (ω) σz 2 N σx 2 j. j=1 18

59 Conditional Probabilities & Bayes Theroem We have considered P (ζ), the probability of an event ζ. Also obeying axioms of probability are conditional probabilities: P (ψ ζ), the probability of the event ψ given that the event ζ has occurred. Recast the axioms as P (ψ ζ) P (ψζ) P (ζ) I. P (ψ ζ) 0 II. P (ψ ζ)+p( ψ ζ) =1 III. P (ψζ η) = P (ψ η)p (ζ ψη) = P (ζ η)p (ψ ζη) 19

60 From Bayes Theorem to Bayesian Inference How does this relate to experiments? Use the product rule: P (ζ ψη) = P (ζ η)p (ψ ζη) P (ψ η) or, letting M = model (or hypothesis), D = data and I = background information (assumptions), Terms: prior: P (M I) P (M DI) =P (M I) P (D MI) P (D I) sampling distribution for D: P (D MI) (also called likelihood for M) prior predictive for D: P (D I) (also called global likelihood for M or evidence for M) 20

61 Particular strengths of Bayesian method include: 1. One must often be explicit about what is assumed about I, the background information. 2. In assessing models, we get a PDF for parameters rather than just point estimates. 3. Occam s razor (simpler models win, all else being equal) is easily invoked when comparing models. We may have many different models, M i that we wish to compare. Form the odds ratio: from the posterior PDFs P (M i DI) O i,j P (M i DI) P (M j DI) = P (M i I) P (D M i I) P (M j I) P (D M j I). 21

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring 2011 Reading Chapter 5 (continued) Lecture 8 Key points in probability CLT CLT examples Prior vs Likelihood Box & Tiao