Modern Methods of Data Analysis - WS 07/08

Modern Methods of Data Analysis Lecture V (12.11.07) Contents: Central Limit Theorem Uncertainties: concepts, propagation and properties

Central Limit Theorem Consider the sum X of n independent variables, with i = 1,2,3,..., each taken from a different distribution with expectation value and variance Then the distribution for properties: has the following Its expectation value is Its variance is It becomes Gaussian distributed for n

Uniform distribution

Uneven Distribution

Exponential Distribution

But... In order to work No source contributes significant to overall variance multiple scattering Distribution with few events in large tails need many more statistic to converge. CLT unfortunately doesn't work for many physics applications!

Application: Many repeated Measurements

m(b0) = 5279.63 ± 0.53 (stat) ± 0.33 (sys) CDF has a mass resolution of 16 MeV the reconstructed mass of a single B meson is spread around the true B mass with σ=16 MeV The B mass can be measured with way better precision

Law of large numbers...

Weighted Mean (I) Combining measurements with different uncertainties: Twice time measurement of the seed of a car: v1 = 67 ± 4 m/s v2 = 62 ± 2 m/s Uncertainty on mean is larger than single uncertainty...???

Weighted Mean (II) To reach σ = 2 m/s, 4 single measurements with σ = 4 m/s are needed. Therefore this measurement should get 4 times the weight! More general:

Weighted Mean (III) v1 = 67 ± 4 m/s v2 = 53 ± 2 m/s????

New Scientist, 31 March 1988

Neutron Lifetime (PDG 2006)

Particle Data Group: World Average 1) Calculate weighted mean 2) Calculate there are 3 cases to consider < 1: all fine, simple weighted mean is OK 1: Depending on the reason, either make no average at all, or quote calculated average and make educated guess of error, taking into account known problems with data > 1: Errors on some or all measurements may have been underestimated, scale all of them with: to compute S, reject ones with larger uncertainties

Ideogram http://pdg.lbl.gov/2007/reviews/textrpp.pdf

Reminder: Covariance covariance of two variables x, y is defined as: if two variables are uncorrelated cov(x,y) = 0 If cov(x,y)=0 x,y are uncorrelated Correlation is defined as:

Properties Correlations Example of correlations for two random variables x and y the covariance V(x,y) or cov(x,y) can be represented by a matrix (often called error matrix ) General case of correlations for n random variables, there is covariance between each pair of variables analogous, correlation matrix is defined

Error Propagation (I) x= Vi,j and µi known y(x) is function of first order Taylor expansion...

Error Propagation (II)

Gaussian error propagation Error estimates for functions of several correlated variables : Normal errors for uncorrelated variables Additional terms accounting for correlations Special case, uncorrelated variables: This is called Gaussian error propagation, however has nothing to do with Gaussian distributions

And the same in more dimensions (A is Jacobi matrix)

Example: Track Parametrization (I) Typical tracking chamber measures 3D, due to symmetry uses cylindrical coordinates: (r,φ,z) Want to know what are the uncertainties in Cartesian coordinates (x,y,z): x = r cosφ; y = r sinφ

Example: Track Parametrization(II)

Exercise: x,y are two random variables with the correlation matrix V: a = 3/7 x +1/7 y; b = 1/7 x - 2/7y; Give the correlation matrix of a,b (note this time it is not an approximation...)

Orthogonal Transformation For n variables one can always find a linear transformation such that the in the covariance matrix of the new set of variables is diagonal.

Exercise: Some standard formulae

Example: Measure Asymmetry (I) Define A = (F-B)/(F+B) an asymmetry of events in forward and backward hemisphere of the detector (e.g. @ LEP). Here F (B) is the number of events in forward (backward) hemisphere Case I: If the events in forward and backward hemisphere are uncorrelated then: if errors are individually Poisson distributed for F and B, then errors are dominated by smaller counting rate

Example: Measure Asymmetry (II) case II: events in forward and backward hemisphere are completely anti-correlated, N is thus fixed - distribution of events F and B are then given by Binomial distribution; let p be probability that event is in forward h.s.: - this is the same expression as before demonstrates relationship between Poisson & Binomial - either Poisson prob of obtaining N events altogether times binomial prob. of having F events in forward - or: two independent Poisson prob. in the number of backward and forward events

Repetition: Histogram Interpretation of bins A histogram can be equivalently regarded as: 1. A Poisson distribution in the overall number of events N and the corresponding multinomial distribution of obtaining events in each bin 2. An independent Poisson distribution of the number of entries in each bin of the histogram.

Exercise: Detector Efficiency Compute the error on measuring detector efficiency in two different ways: Binomial distribution; p: probability to detect traversing particle, N number of events Using error propagation, : number of detected events :number of not detected events, treat as independent variables

Be aware... The approximation using Taylor expansion breaks down if the function is significantly not linear in the region ± 1σ around the mean value. Example: momentum estimate in B field; p ~ 1/κ 10 % momentum bias!

st Failure of (1 order) error propagation experiment had data to look for non-zero mass of electron neutrino Quantity R was measured: Don't need to know details. Sufficient to know: a, b, c, d, e are measured quantities and K is constant. If R < 0.42, then neutrino must have mass. The experiment measured R=0.165 and found with error propagation σ(r) = 0.073 -> 3σ evidence for neutrino mass! however, the formula is highly non-linear... 1st order error propagation not applicable!

What to do instead? Use MC methods! Throw Gaussian distributed values for a,b,c,d,e Compute R; repeat toy many times and check how often you are below the measured value of R. In example this happens in 4% of the cases. This is the so-called p-value of this result. In many physics analysis, simple Gaussian error propagation not valid or too complicated... (e.g. highly non linear functions, many correlated variables) p-value MC method always works!