Lecturer: Allen Caldwell, Max Planck Institute for Physics & TUM Recitation Instructor: Oleksander (Alex) Volynets, MPP & TUM General Information: - Lectures will be held in English, Mondays 16-18:00 - Recitation, Mondays 14-16:00 CIP room, starting May 9, 2011 - Exercises will be given at the end of the lecture, for you to try out. Will be discussed, along with other material, in recitation of following week - Course material available under: http://www.mpp.mpg.de/~caldwell/ss11.html http://www.mpp.mpg.de/~volynets/ss11/ NB: you will get the most out of this course if you can practice solving problems on your own computer! Lecture 1 1
Modeling and Data We image flipping a coin, or rolling dice, or picking a lottery number. The initial conditions are not known, so we assume symmetry and say every outcome is equally likely: - coin, heads or tails each have equal chance, probability of each is 1/2 - - rolling dice - each number on each die equally likely: each pair (6x6) equally likely. E.g., (1,1) (3,4) - i.e., we make a model for the physical process. The model contains assumptions (e.g., each outcome equally likely). - given the model, we can make predictions and compare these to the data. - from the comparison, we decide if the model is reasonable. 2
Theory g( y λ,m) y λ M Are the theoretical observables Are the parameters of the theory Is the model or theory Modeling of experiment x f( x λ,m) Is a possible data outcome compare D Experiment 3
Notation g( y λ,m) Is understood as a probability density. I.e., the probability that y is in the interval y y + d y given the model M and the parameter values specified by λ. e.g., in we are considering the decay of a unstable particle, we would have Probability density of decay occurring at time t g(t τ) e t/τ for single particle, assuming constant probability per unit time. 4
How we learn Deduction 5
How we Learn We learn by comparing measured data with distributions for predicted results assuming a theory, parameters, and a modeling of the experimental process. What we typically want to know: Is the theory reasonable? I.e., is the observed data a likely result from this theory (+ experiment). If we have more than one potential explanation, then we want to be able to quantify which theory is more likely to be correct given the observations Assuming we have a reasonable theory, we want to estimate the most probable values of the parameters, and their uncertainties. This includes setting limits (>< some value at XX% probability). 6
Logical Basis Model building and making predictions from models follows deductive reasoning: Given A B (major premise) Given B C (major premise) Then, given A you can conclude that C is true etc. Everything is clear, we can make frequency distributions of possible outcomes within the model, etc. This is math, so it is correct 7
Logical Basis However, in physics what we want to know is the validity of the model given the data. i.e., logic of the form: Given A C Measure C, what can we say about A? Well, maybe A 1 C, A 2 C, We now need inductive logic. We can never say anything absolutely conclusive about A unless we can guarantee a complete set of alternatives A i and only one of them can give outcome C. This does not happen in science, so we can never say we found the true model. 8
Logical basis Instead of truth, we consider knowledge Knowledge = justified true belief Justification comes from the data. Start with some knowledge or maybe plain belief Do the experiment Data analysis gives updated knowledge 9
Formulation of Data Analysis In the following, I will formulate data analysis as a knowledge-updating scheme. Knowledge+data updated knowledge This leads to the usual Bayes equation, but I prefer this derivation to the usual one in the textbooks. 10
Formulation We require that P ( x λ,m) 0 P ( x λ,m)d x =1 although as we will see the normalization condition is not really needed. The modeling of the experiment will typically add other (nuisance) parameters. E.g., there are often uncertainties, such as the energy scale of the experiment. Different assumptions on these lead to different predictions for the data. Can have P ( x λ, ν,m) where ν represents our nuisance parameters. 11
Formulation The expected distribution (density) of the data assuming a model M and parameters λ is written as P ( x λ,m) where x is a possible realization of the data. There are different possible definitions of this function. Imagine we flip a coin 10 times, and get the following result: T H T H H T H T T H We now repeat the process with a different coin and get T T T T T T T T T T Which outcome has higher probability? 12
Take a model where H, T are equally likely. Then, And outcome 1 outcome 2 prob = (1/2) 10 prob = (1/2) 10 Something seem wrong with this result? This is because we evaluate many probabilities at once. The result above is the probability for any sequence of ten flips of a fair coin. Given a fair coin, we could also calculate the chance of getting n times H: ( 10 n )( ) 10 1 2 13
And we find the following result: n p 0 1 2 10 1 10 2 10 2 45 2 10 3 120 2 10 4 210 2 10 5 252 2 10 6 210 2 10 7 120 2 10 8 45 2 10 9 10 2 10 10 1 2 10 There are many more ways to get 5 H than 0, so this is why the first result somehow looks more probable, even if each sequence has exactly the same probability in the model. Maybe the model is wrong and one coin is not fair? How would we test this? 14
The message: there are usually many ways to define the probability for your data. Which is better, or whether to use several, depends on what you are trying to do. E.g., have measured times in exponential decay. Can define the probability density as N 1 P ( t τ) = τ e t i/τ i=1 Or you can count events in a time interval and compare to expectations P ( t τ) = M j=1 e ν j ν n j j n j! ν j = expected events in bin j n j = observed events in bin j 15
Formulation For the model, we have 0 P (M) 1. For a fully Bayesian analysis, we require i P (M i)=1 For the parameters, assuming a model, we have: P ( λ M i ) 0 P ( λ M i )d λ = 1 The joint probability distribution is P ( λ,m)=p ( λ M)P (M) and P (M i ) i P ( λ M i )d λ =1 16
Learning Rule P i+1 ( λ,m D) P ( x = D λ,m)p i ( λ,m) where the index represents a state-of-knowledge We have to satisfy our normalization condition, so P i+1 ( λ,m D)= M P ( x = D λ,m)p i ( λ,m) P ( x = D λ,m)pi ( λ,m)d λ We usually write. This is our prior information before performing the measurement. 17
P i+1 ( λ,m D)= Learning Rule M P ( x = D λ,m)p i ( λ,m) P ( x = D λ,m)pi ( λ,m)d λ The denominator is the probability to get the data summing over all possible models and all possible values of the parameters. P ( D)= P ( x = D λ,m)p i ( λ,m)d λ M so Bayes Equation 18
Bayes-Laplace Equation Here is the standard derivation: P (A, B) = P (A B)P (B) P (A, B) = P (B A)P (A) So P (B A) = P (A B)P (B) P (A) S B A A B Clear for logic propositions and well-defined S,A,B. In our case, B=model+parameters, A=data 19
Notation-cont. Cumulative distribution function: F(a) = x i θ ) F(a) = a i = a x θ )dx 0 F(a) 1 a x b) = F(b) F(a) Equality may not be possible for discrete case Expectation value: E[x] = E[x] = i= E[u(x)] = x i x i θ) x x θ) dx u(x) x θ) dx For probabilities For probability densities For u(x), v(x) any two functions of x, E[u+v]=E[u]+E[v]. For c,k any constants, E[cu+k]=cE[u]+k. 20
Notation-cont. The n th moment variable is given by: α n E[x n ] = For discrete probabilities, integrals sums in obvious way µ α 1 =E[x] is known as the mean x n x θ) dx The n th central moment of x: m n E[(x α 1 ) n ] = (x α 1 ) n x θ) dx σ 2 V[x] m 2 =α 2 -µ 2 is known as the variance and σ is known as the standard deviation. µ, σ (or σ 2 ) are most commonly used measures to characterize a distribution. 21
Notation-cont. Other useful characteristics: most-probable value (mode) is value of x which maximizes f(x;θ) median is a value of x such that F(x med )=0.5 f(x) mode median mean x 22
Examples of using Bayes Theorem A particle detector has an efficiency of 95% for detecting particles of type A and an efficiency of 2% for detecting particles of type B. Assume the detector gives a positive signal. What can be concluded about the probability that the particle was of type A? Answer: NOTHING. It is first necessary to know the relative flux of particles of type A and B. Now assume that we know that 90% of particles are of type B and 10 % are of type A. Then we can calculate: A signal) = signal signal A) A) A) A) + signal B) B) A signal) = 0.95 0.1 0.95 0.1+ 0.02 0.9 = 0.84 23
We are told in the problem that we know A), B), signal A), and signal B). This information was somehow determined separately, possibly as a frequency, and is the job of the experimenter to determine. Suppose we want to get the Signal to Background ratio for a sample of many measurements, where signal is the number of particles of type A and background is the number of particles of type B: A signal) = B signal) = A signal) = B signal) signal signal signal signal signal A) A) A) A) + signal signal B) B) A) A) + signal A) A) = B) B) 0.95 0.02 B) B) B) B) 0.1 = 5.3 0.9 24
Notation-cont. For two random variables x,y, define joint p.d.f., x,y), where we leave off the parameters as shorthand. The probability that x is in the range x x+dx and simultaneously y is in the range y y+dy is x,y)dxdy. To evaluate expectation values, etc., usually need marginal p.d.f. The marginal p.d.f. of x (y unobserved) is The mean of x is then P x (x) = x,y) dy µ x = xx, y) dxdy = xp x (x) dx The covariance of x and y is defined as cov[x,y] = E[(x - µ x )(y µ y )] = E[xy] µ x µ y And the correlation coefficient is ρ xy = cov[x,y]/σ x σ y 25
Examples y ρ xy =0 The shading represents an equal probability density contour x y ρ xy =-0.8 y ρ xy =0.2 x x The correlation coefficient is limited to -1 ρ xy 1 26
Independent Variables Two variables are independent if and only if Then x,y)=p x (x)p y (y) cov[x, y] = E[xy] µ x µ y = xy x, y) dxdy µ x µ y = xy P x (x)p y (y) dxdy µ x µ y = xp x (x)dx yp y (y)dy µ x µ y = 0 27
Notation-cont. If x,y are independent, then E[u(x)v(y)]=E[u(x)]E[v(y)] and V[x+y]=V[x]+V[y] If x,y are not independent V[x+y]=E[(x+y) 2 ]-(E[x+y]) 2 =E[x 2 ]+E[y 2 ]+2E[xy] (E[x]+E[y]) 2 =V[x]+V[y]+2(E[xy]-E[x]E[y]) =V[x]+V[y]+2cov[x,y] 28
Binomial Distribution Bernoulli Process: random process with exactly two possible outcomes which occur with fixed probabilities (e.g., flip coin, heads or tails, particle recorded/not recorded, ). Probabilities from symmetry argument or other information. Definitions: p is the probability of a success (heads, detection of particle, ) 0 p 1 N independent trials (flip of the coin, number of particles crossing detector, ) r is the number of successes (heads, observed particles, ) 0 r N Then Probability of r successes in N trials r N, p) = N! r!(n r)! pr q N r where q =1 p Number of combinations - Binomial coefficient 29
Derivation: Binomial Coefficient Ways to order N distinct objects is N!=N(N-1)(N-2) 1 N choices for first position, then (N-1) for second, then (N-2) Now suppose we don t have N distinct objects, but have subsets of identical objects. E.g., in flipping a coin, two subsets (tails and heads). Within a subset, the objects are indistinguishable. For the i th subset, the n i! combinations are all equivalent. The number of distinct combinations is then N! n 1!n 2! n n! where i n i = N For the binomial case, there are two subclasses (Success&failure, heads or tails, ) The combinatorial coefficient is therefore N N! = r r!(n r)! 30
Binomial Distribution p is the probability of a success (heads, detection of particle, ) 0 p 1 N independent trials (flip of the coin, number of particles crossing detector, ) r is the number of successes (heads, observed particles, ) 0 r N Then Probability of r successes in N trials r N, p) = N! r!(n r)! pr q N r where q =1 p Number of combinations - Binomial coefficient 31
Binomial Distribution-cont. P=0.5 N=4 P=0.5 N=5 P=0.5 N=15 P=0.5 N=50 P=0.1 N=5 P=0.1 N=15 P=0.8 N=5 P=0.8 N=15 E[r]=Np V[r]=Np(1-p) Notes: for large N, p near 0.5 distribution is approx. symmetric for p near 0 or 1, the variance is reduced 32
Example Example: You are designing a particle tracking system and require at least three measurements of the position of the particle along its trajectory to determine the parameters. You know that each detector element has an efficiency of 95%. How many detector elements would have to see the track to have a 99% reconstruction efficiency? Solution: We are happy with 3 or more hits, so our probability is P (r 3 N, p) = N P (r N, p) > 0.99 r=3 33
Example-cont. N N N 3! = = = = 3!(3 3)! 3 3 3 3 3 f (3; 3, 0.95) (0.95) (1 0.95) 0.95 0.857 4! = = = = 3!(4 3)! 3 4 3 3 4 f (3; 4, 0.95) (0.95) (1 0.95) 4(0.95) (0.05) 0.171 f 4! 4!(4 4)! 4 4 4 4 (4; 4, 0.95) = (0.95) (1 0.95) = (0.95) = 0.815 5! 3 5 3 3 2 = 5 f (3; 5, 0.95) = (0.95) (1 0.95) = 10(0.95) (0.05) = 0.021 3!(5 3)! f f 5! 4!(5 4)! 4 5 4 4 (4; 5, 0.95) = (0.95) (1 0.95) = 5(0.95) (0.05) = 0.204 5! 5!(5 5)! 5 5 5 5 (5; 5, 0.95) = (0.95) (1 0.95) = (0.95) = 0.774 0.986 0.999 With 5 detector layers, we have >99% chance of getting at least 3 hits 34