Data Fitting - Lecture 6

Similar documents
Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

18 Bivariate normal distribution I

01 Probability Theory and Statistics Review

Statistical Methods in Particle Physics

5 Operations on Multiple Random Variables

Multiple Random Variables

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

If we want to analyze experimental or simulated data we might encounter the following tasks:

Statistical Data Analysis Stat 3: p-values, parameter estimation

Statistical Methods in Particle Physics

Statistics notes. A clear statistical framework formulates the logic of what we are doing and why. It allows us to make precise statements.

ACM 116: Lectures 3 4

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

conditional cdf, conditional pdf, total probability theorem?

Infinite Series. 1 Introduction. 2 General discussion on convergence

Joint Gaussian Graphical Model Review Series I

Probability Distributions - Lecture 5

Statistics and Data Analysis

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Statistics for scientists and engineers

Preliminary statistics

EE4601 Communication Systems

ENGG2430A-Homework 2

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

Probability Theory and Statistics. Peter Jochumzen

Lecture Note 1: Probability Theory and Statistics

Statistical Methods in Particle Physics

Lecture : Probabilistic Machine Learning

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

More than one variable

Continuous Random Variables

Exercises and Answers to Chapter 1

The Treatment of Numerical Experimental Results

Review of Probability Theory

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Statistics and data analyses

Statistics, Data Analysis, and Simulation SS 2015

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Introduction to Probability and Stocastic Processes - Part I

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Review of Statistics

Statistics 351 Probability I Fall 2006 (200630) Final Exam Solutions. θ α β Γ(α)Γ(β) (uv)α 1 (v uv) β 1 exp v }

2 (Statistics) Random variables

Multivariate Distribution Models

Signal Processing - Lecture 7

Data Analysis and Monte Carlo Methods

Preliminary Statistics. Lecture 3: Probability Models and Distributions

3. Probability and Statistics

Practice Examination # 3

Let X and Y denote two random variables. The joint distribution of these random

04. Random Variables: Concepts

STATISTICS OF OBSERVATIONS & SAMPLING THEORY. Parent Distributions

Lecture 2: Repetition of probability theory and statistics

Introduction and Vectors Lecture 1

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

This does not cover everything on the final. Look at the posted practice problems for other topics.

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Topics in Probability and Statistics

Lecture 2: Review of Probability

Introduction to Normal Distribution

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Review (Probability & Linear Algebra)

Final Exam # 3. Sta 230: Probability. December 16, 2012

The Delta Method and Applications

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

A Probability Review

18.440: Lecture 26 Conditional expectation

Parameter Estimation and Fitting to Data

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Introduction to Computational Finance and Financial Econometrics Matrix Algebra Review

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

STA 2201/442 Assignment 2

Chapter 5 continued. Chapter 5 sections

Multivariate Distributions

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Introduction to Error Analysis

Statistical Methods in Particle Physics

Robots Autónomos. Depto. CCIA. 2. Bayesian Estimation and sensor models. Domingo Gallardo

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

Introduction to Machine Learning

2 Functions of random variables

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10

Algorithms for Uncertainty Quantification

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Lecture 11. Probability Theory: an Overveiw

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Strong Lens Modeling (II): Statistical Methods

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

Statistical techniques for data analysis in Cosmology

Transcription:

1 Central limit theorem Data Fitting - Lecture 6 Recall that for a sequence of independent, random variables, X i, one can define a mean, µ i, and a variance σi. then the sum of the means also forms a mean and the sum of the variances also forms a variance. This is true for any distribution, provided the X i are independent, and the means and variances are mathematically reasonable. The central limit theorem states how the sum is distributed for N ; N S lim N N µ i σ i P N (0, 1) In the above, S = µ i, and P N (0, 1) is the normal distribution with mean 0 and variance 1. This means that no matter the original distribution for the variables, the mean of the means is normally distributed for a sufficiently large sample. The theorem holds in the general case, but here assume that µ i = µ and σ i = σ. Suppose a set of random variants, X i with expectation values, µ, and variances, σ. Define; Y N = Xi N Z N = Y N µ σ/ N The expectation value and variance of Z N is; E(Z N ) = 0 V (Z N ) = 1 Then we find that; lim N P(Z N < a) = 1/π a dx e X / For example, it is generally accepted that a χ distribution with N 30 is reasonably normal. To consider normality of a distribution, moments higher than may be important. 1

Example As previously mentioned, any Monte Carlo simulation requires the generation of a series of random numbers. These are usually distributed between 0 and 1, but a number distributed according to a probability density function is also useful. Using the Central Limit Theorem one can produce a random number generator distributed by the Normal probability distribution. Choose the i th number from a uniform generator to represent µ i. Then for a sequence of random numbers between 0 and 1, N, form; N µi N/ g = N/1 In the above we have used that the average of a flat distribution, µ i = 1/ with a variance, σ = 1/1. (A uniform distribution for a X b has E(X) = a + b and V (X) = E(X ) µ (b a) = 1 ) It is clear that the normal distribution, P N (0, 1) is generated as N. In practice, g is close to normal when N 10, but the tails of the distribution are cut with respect to the Normal distribution. 3 Covariance Suppose a normal distribution defined by the two random variables x and y. Bayes theorem can be used to write; P x (x/y) = P(x y)/p y (y) P y (y/x) = P(x y)/p x (x) Where we have previously shown that if the probabilities are independent; P x (x/y) = P x (x); P y (y/x) = P y (y); The meaning of the intersection of the probability sets A and B is shown in Figure 1. The expectation value (mean) is the 1 st moment; E(x) = dxdy xp(x y) and the expectation of a general function f(x, y) is ;

A B A B A B Figure 1: The intersection of probability distributions E(f) = dxdy f(x, y) P(x y) The variance is defined as (x 1 = x and x = y), ; σ (f) = dxdy E[(f E(f)) ] P(x y) If the variables are independent, P(x y) = P x (x)p y (y). The double integral separates into the multiplication of the integral over x and the integral over y. E(x) = dxdy xf(x, y) E(y) = dxdy y f(x, y) σ x = E([x E(x)] ) σ y = E([y E(y)] ) Thus for example, the probability of P(x y) is the product of the two normal probability distributions in x and y which is also a normal distribution. 4 Covariance A joint probability distribution of two random variables has a covariance defined by; 3

cov(x, y) = E([x E(x)][y E(y)]) = E(xy) E(x)E(y) If the random variables are mutually independent then their joint probability density factors, f(x, y) = f 1 (x)f (y). This results in the expectation value; E(xy) = dxf 1 (x) x dy f (y) = E(x)E(y) Thus the covariance vanishes for independent variables. However, the converse is not necessarily true.that is, if the covariance vanishes the variables are not necessarily independent. The correlation coefficient is defined by; corr(x, y) = ρ(x, y) = cov(x, y) σ x σ y Covariance can be defined between any two variables of a many-variable system. A correlated error would develop if for example, the density were measured as a function of temperature and pressure. For a given density, temperature and pressure are correlated by the equation of state. Thus for the density, ρ; δ ρ = ρ T dt + ρ P dp (δ ρ) = ( ρ T ) dt + ( ρ P ) dp + ( ρ T ρ )( ) dt dp P Thus correlations are present and if the equation connecting the variables is not linear then we must approximate the errors by the first order differential terms, ie we assume the errors are small. As an example, Table finds the correlation between two random selections of random numbers. These are given in the 1 st two columns. The next columns give the deviations from the average values, and the last column gives the covariance. The average variance is divided by N 1 instead of N since one degree of freedom has been used to find the mean. The correlation coefficient is obtained from the covariance by dividing by the standard deviations as illustrated above. The results from Table are; ρ(x, y) = (x µx )(y µ y )/9 σ x σ y = 0.00 As ρ 0 the data do not seem correlated. The correlation coefficient, ρ, is bounded by ±1. Note that the covariance of a random variable with itself equals the variance of that variable. 4

Figure : The correlation between sets of 10 random numbers 5 Error for a multi-variant system Suppose two independent, normal distributions of the variable x and y. The probabilities about the means are; P(x) = 1 π σx e x /σ x P(y) = 1 π σy e y /σ y Because the variables are independent; P(x y) = 1 πσ x σ y exp[ (1/)( x σ x ) (1/)( y σ y ) ] Then as a matter of notation write; ( ) 1/σ M = x 0 0 1/σy For the e 1 error contour in the (x, y) plane; ( ) ( ) x x y M = 1 y 5

Multiplication gives the equation; ( x σ x ) + ( y σ y ) = 1 This generates an ellipse with axes, x and y in the x, y plane. The inverse of the matrix is M M 1 = I. ( ) σ M 1 = x 0 0 σy Here M 1 is the error matrix. It is diagonal because x and y are independent. Off-diagonal terms indicate error correlations. In general an element of the error matrix due to variables x i is connected to the previous notation by identifying x 1 = x, x = y. M ij = E([x i E(x i )][x j E(x j )]) The error matrix is symmetric and off diagonal terms are given by the covariance; cov(x i, x j ) = E(xy) E(x)E(y) E(x i x j ) = dxdy P(x i x j ) x i x j Although the above development assumed a normal distribution, the error matrix formulation is now defined for any probability distribution. However, the development is based on the normal distribution, and assumes that the measured function is linear or the errors are small. 5.1 Example First suppose (x, y) are independent with; σ x = /4 σ y = / The error matrix is diagonal as shown in Figure 3. Suppose we rotate the axes by an angle θ. The rotation matrix is; ( ) cos(θ) sin(θ) sin(θ) cos(θ) This produces the coordinates x and y. ( ) ( ) x x = cos(θ) y sin(θ) y x sin(θ) + y cos(θ) 6

1/ X X X 1 1/8 1/8 X 1 4 X 1+ X = 1/ 1/ Figure 3: An example of uncorrelated and correlated error in variables To continue the above example, let cos(θ) = 1/. Substitute into the error matrix equation for the new coordinates, and invert the error matrix. This results in, x [ cos (θ) σ x + sin (θ) σy ] + y [ sin (θ) σx 1/[13 x 6 (3) x y + 7 y ] = 1 + cos (θ) σy ] x y [cos(θ) sin(θ)][1/σx 1/σy] = 1 The probability distribution has not changed but the error axes are rotated and there is a correlation. The matrix equation is ; ( ) ( x y 13/ 3 ) ( ) 3/ 3 x 3/ 7/ y = 1 The error matrix is the inversion of M. σ x = 14/64 σ y = 6/64 cov(x, y ) = 6 3/64 ( ) 7 3 3 M 1 = (/64) 3 3 13 This is shown in the figure 3. If the covariance were negative it would mean that increasing x means decreasing y. The general form for a normal distribution is 7

P(x, y) = 1 1 π σ x σ EXP[ (1/)( 1 y 1 ρ 1 ρ )[x /σx + y /σy ρ xy σ x σ ]] y The correlation coefficient is ρ. Now extend this to k variables. The probability is; P = 1 1 (π) k/ M 1/ EXP[ (1/) x M x] In the above M 1 is the error matrix, x is the variable vector, and M is the determinant. In the -D case; ( ) M 1 σx = cov(xy) cov(xy) σy ( ) M = 1 1/σ x ρ/σ x σ y 1 ρ ρ/σ x σ y 1/σy Note that MM 1 = 1 and 1 1 ρ is the normalization. Correlations in multi-dimensions are difficult to handle and should be avoided if possible by working with independent variables. 6 Changing variables in the error matrix The error matrix can be obtained by observing the measurement of a set of variables a number of times while holding all other variables fixed,. If the variables are independent, the off-diagonal terms vanish. On the other hand the error matrix can be manipulated into diagonal form or in fact, the error for any variable set can be found. Thus suppose we measure a function of variables, y = y(x 1, x ). The error in y is then; δy = y x 1 δx 1 + y x δx δy = ( y x 1 ) δx 1 + ( y x ) δx + y x 1 y x δx 1 δx The expectation value E[(y E(y)) ] = δy. The error matrix takes the form; ( ) δy = y y ( ) y δx 1 δx 1 δx x 1 x 1 x δx 1 δx δx y x The matrix is the error matrix, M 1, and the vectors, D, are the errors in the respective coordinates. σ y = D 1 M 1 D 8

Note that we deal here with differential quantities approximated by differentials. If y(x 1, x ) is not linear in the variables, the errors should be small or the development is inaccurate. Suppose we change the variable from y to z. In matrix notation, this is written; ( ) y y y δ z y z y δyδz δyδz δ = x 1 x x 1 x 1 x 1 x 1 z z z y z y z x 1 x x x x x This takes the following form where T is the transformation matrix; σ y = D 1 T 1 MT D 7 Example Suppose we measure the coordinates (x, y) of a point, and transform this to a polar coordinates. r = x + y tan(θ) = y/x Assume that (x, y) are independent. To find the (r, θ) error matrix evaluate the transformed matrix. ( σr cov(r, θ) cov(r, θ) σθ ) = ( xr y r y x r r ) ( σ x 0 0 σ y ) ( x r y y r x r The transformation matrix is not symmetric which requires a reordering of the elements. Also the transformation is not linear so the error matrix must be small. cov(r, θ) = xy r 3 (σy σ x ) ( ) σr cov(r, θ) cov(r, θ) σθ = (1/r ) ( x σx + y σy (xy/r)(σy σ x ) ) (xy/r)(σy σx) (1/r )(y σx + x σy) Although the normal distribution was used to develop the approach to error estimation, the normal distribution is not always valid. For example, the tails of the normal distribution fall rapidly, usually so rapidly that small effects can influence a fit to the data. Also in counting, particularly with few events, the binomial (or Poisson) distribution may give a better representation. However, one can introduce techniques to deal with distributions that r ) 9

have larger tails or have points that lie well outside normally expected errors (outliers). The subject of robust estimators (covered later) deals these cases. 8 Modeling of data - General comments Given a set of data, one usually wishes to summarize them by fitting to some model. This means, adjusting the parameter set to agree with the model using some criteria. Modeling can be as simple as selecting a set of polynomials with appropriate coefficients that best fit the data, or by selecting parameters in a completed theory. Thus in general; 1. One must obtain a set of optimum parameters that represent a set of functions.. One must obtain an error estimate in these parameters 3. One must obtain an over all measure of the statistical goodness of fit of the model to the total data set. To implement the fitting procedure one needs a figure-of-merit by which comparison between different parameter sets can be made. This function measures agreement between the data set and the model for a particular choice of parameters. We proceed by finding the minimum in the figure-of-merit function with respect to the parameters. Of course the data are not exact and a model, even if correct, will not exactly fit the data points. Thus the goodness-of-fit is compared to a statistical standard. The most used test is based on χ, and defined by; χ = data (Observed value - expected value) (error in observed value) The calculation of χ results in a χ probability distribution which allows a goodness-of-fit to be determined by the probability that on repetition different expectation values will be found. 9 Least squares as a Likelihood Estimator Suppose we fit N data points, X i, to a model with M parameters, a j. Using the model the prediction is; y(x) = y(x i, a j ) for all i and j. The figure of merit function is χ. 10

χ = N [y i y(x i ; a j )] Then, it is not meaningful to ask what is the probability that a set of parameters is correct? because we do not draw the parameters from a statistically infinite number of models. There is only one model and the data set is fitted to this model. We can determine, however, the probability that a data set occurs given the parameters. Then we must apply Bayes theorem. Suppose we have a signal, F(ω, t), and a source of background noise, G(σ). We obtain a set of measurements, D i, for a set of independent parameters, t i. D i = F(ω, t i ) + G(σ(t i )) The object is to obtain ω. Assume the noise function is normally distributed. P(D ω) = ( 1 ) N ( π i Q i = (D i F(ω, t i )) σi To reduce complexity, define; 1 σ i ) EXP[ i Q i ] (σ/σ i )D i D i (σ/σ i )F(ω, t i ) F(ω, t i ) P(D ω, σ) = ( 1 π ) N ( 1 σ )N EXP[ i Q i ] Q i = (D i F(ω, t i )) In this case, σ and ω are to be determined by the fit. The parameters, ω, may enter the probability in various forms. This probability is the likelihood and is product of the separate probabilities of each data point, as they are independent. P N [EXP([D i F i (ω)] /σ )] Maximizing the likelihood is the same as minimizing the negative of its logarithm, and this is the same as minimizing, χ. χ = N (D i F i (ω)) σ The least squares estimator is a likelihood estimation if the measurement errors are inde- 11

pendent and are normally distributed. Recall that the normal distribution results in a sum of a large number of small deviations as shown by the central limit theorem. 9.1 Example Fit a straight line to data points as an example of a general technique. Each point, x i, is associated with a data measurement, d i, which has some error, σ i. The model to represent the data is assumed to have the functional form, y = ax + b, where a and b are to be varied to find the best representation of the model to the data. At this point we begin to introduce the experimental uncertainty by applying Bayes theorem; P(M/D, σ) = P(D/M, σ)p(m/σ) P(D/σ) The denominator is a normalization that is obtained by integration over all parameters and is not so important. P(M σ) is the probability of the model, M, and P(D M, σ) is the likelihood of the data, D, given the model. We assume the likelihood is normally distributed and given by; In this case ; P(D M, σ) (1/σ) N exp[ (1/)(Q/σ) ] Q = N [y i ax i b)] where the errors, σ i = σ, are constant. There are N data points, and the model parameters are a and b. Bayes theorem then finds the probability of the model given the data and the error. Note we can calculate the data if we have the model, P(D M, σ) so Bayes theorem inverts the probabilities giving the result we seek. Now we assume a uniform prior probability, P(M σ) = constant. ie all models are equally probable. This means that P(M D, σ) is proportional to the likelihood, so maximizing the likelihood maximizes the probability. P(D M, σ) (parameters i ) exp[ (1/)(Q/σ) ] Q (parameters i ) = 0 This leads to the following simultaneous equations; (y i ax i b) x i = 0 i 1

(y i ax i b) = 0 i Put in matrix form; ( x i xi xi N ) ( a b ) = ( ) yi x i yi As we intend to deal with a linear equation set, or will linearize the equations in the limit of small errors, the matrix formulation of the equations is most efficient. The above equations are solved to obtain the best fit values for the parameters, a and b. In this case, the equation is; C A = Q A = C 1 Q 9. Generalization Now suppose each data set has its own standard deviation, σ i, and the model has the general form, y = y(x i, a 1, a a k ), where the a j are model parameters. This is equivalent to a least square fit ( chi-square ). To the extent that the data are normally distributed, the χ function is a sum of N normally distributed functions that are properly normalized. Once fit, these functions are not statistically independent because of the constrained equation set, so that the number of independent equations is ν = N L. Then L is the number of fitted parameters, which is called the degree of freedom. The error in the χ fit is normally distributed. P(χ /a i ) (χ ) ν/ 1 exp[ (1/)χ ] 9.3 Non-linear functions We could attempt to solve non-linear equations by any mathematical technique. However, they are usually too complicated to solve directly so that one can attempt to find a perturbation solution about a minimum in χ space. Thus one looks for a linear expansion of the probability in terms of the parameters, and iterates this to convergence. The error in the fit is obtained from a calculation of the area remaining in the tail of the χ distribution as a function of χ for a specific degree of freedom, see figure 4. As an example, suppose a set of parameters which form the coordinates of the vector, a. Then expand χ ( a) about its minimum value. 13

Figure 4: Percentage error in the tail of χ distributions χ ( a) γ d a + (1/) a D a In the above, d, is the vector with components, χ a i a= a0 and D ij = χ a i a j. Thus; χ = D a d The above equation is designed so that χ = 0 at the minimum where a = a 0. The perturbation equation is then; a p = a + D 1 [ χ ( a)] Then place a p a in the above and iterate to convergence. Near the minimum the function χ is parabolic and a is changed so that the iteration follows the path of steepest descent along the parabola to the minimum. 14

10 Maximum likelihood method For some problems, the maximum likelihood method is easier to apply than the least-square technique, but for a normal distribution the results are statistically equivalent. However, the likelihood method can be used with a general probability density function, and Bayes theorem can be applied as a learning filter. It has the drawback of needing a normalized probability density that must be updated as the parameters change. We find that; 1. The maximum likelihood uses events as they occur (see the Kalman filter described later). The likelihood can be used with low statistics where it is most efficient 3. The likelihood and χ methods are equivalent for a normal distribution 4. The likelihood requires substantial computation, especially due to the need for renormalization of the probability density. 5. It is difficult to determine the errors when using the likelihood method. 11 Importance of Normalization of the Likelihood Suppose a set of data consistent with the angular distribution; y = dn d(cos(θ)) = N[1 + (b/a)cos (θ)] In the above N is the normalization factor which makes; 1 1 N = d(cos(θ)) y = 1 1 [1 + b/3a] Note that if y does not remain appropriately normalized the maximum likelihood method does not work. For the i th event we calculate; y i = N[1 + (b/a)cos (θ i )] This is the probability density of observation of that event. Obviously it depends on (b/a). Then apply Bayes theorem; P(w D, M) = P(D w, M)P(w m) P(D M) 15

We identify the above as; P(post) = P(likelihood)P(prior) P(normalization) Thus we must maximize the likelihood based on the probability estimator (ie the probability density function (pdf)). Note that the pdf does not need to be normally distributed. For the example here, the likelihood, L, is a product of the y i for each event of the sample. This represents the probability of independent events. L(a/b) = M y i Note that the probability should not depend on the order of events in the product so there should be a factor of N! multiplying the likelihood, however constant factors are not important. It is only the factors of (b/a) that need to be considered. To obtain the result, we maximize L (actually ln[l]) by varying the parameters. Therefore write; l = ln[l] = N ln(y i ) For a large number of observations (n ), L tends to be normally distributed, at least near its maximum. Thus by expansion; l = l max + dl dp δp + (1/) d l dp δ P + At the maximum the second term, dl, vanishes. If we assume; dp L = A EXP[ (P P 0) c ] then we identify; l = ln(a) δ P/c and d l dp = 1/c. Then if L is normally distributed, the standard distribution can be used for the error. On the other hand we could use, [ l P ] 1/, or an average of this function over the measured range. For an element of the inverse error matrix we use; M ij = P i P j l where P k (i, j = k) is evaluated at the maximum value of P. Then we see that; 1. The maximum likelihood can use events as they occur, so one does not need data tables and it can be used for a few events. 16

. It is most efficient for low numbers of events. 3. The likelihood and least squares methods are equivalent when the probability and errors are normally distributed. 4. When constraints are to be imposed on the parameters, then these are inserted by restricting the parameters to remain in the allowed regions. Lagrange multipliers can be used where necessary. However, problems may arise if the maximum probability is on or near a boundary. 5. It is always best to fit the data with background without attempting to subtract out the background events. 6. The likelihood method requires substantial computation, especially due to the normalization requirements. 7. It is difficult to estimate error using the likelihood method, and to determine how well the data are represented by the model. 1 Examples We wish to determine the lifetime of a decaying particle. N events are observed, characterized by the flight path l i, from production to decay point. The experiment has an upper and lower bound on the distance it can measure. The time as a function of distance is; t i = l i β i γ i c In the above, βγc is the Lorentz boost for the particle. The limits are imposed as; t i (mx) = l i(mx) β i γ i c t i (mn) = l i(mn) β i γ i c The observations are independent so the likelihood is the product of the probability of each observation. L = N (1/τ) e t i/τ e t i(mn)/τ e t i(mx)/τ The value of τ is the lifetime to be determined, and the denominator of the above equation normalizes the probability. Then take the log of the likelihood. ln[l] = N [ t 1 /τ ln[e ti(mn)/τ e ti(mx)/τ ] N ln(τ) 17

The maximum likelihood estimate we seek, τ, is obtained by; ln[l] τ = (1/τ ) N [t i g i (τ)] Nτ = 0 g i (τ) = t i(mn)e t i(mn)/tau t i (mx)e t i(mx)/τ e t i(mn)/τ e t i(mx)/τ This equation may be solved by Newton s method which we applied in the discussion of non-linear equations. This is an iterative approach. To determine the variance, take the inverse of the nd derivative of the log-likelihood. If we define F = N [t i g i ] Nτ then; ln(l) τ = F/τ 3 + F /τ = F/τ 3 0 + F /τ 0 = 0 where τ 0 is the extracted lifetime from the likelihood estimator. Thus the standard deviation using σ = l P is; σ τ 0 N[ 1 + g 0 + (/τ 0 )( t g 0 )] g i = (1/N) N g i As another example we wish to fit a measurement of a particle mass, M 0, to a Breit-Wigner distribution. F(M i ) = Γ/ (M i M 0 ) + (γ/) + a[(1 + b(m i M ) + c(m i M ) The second term in the above is the assumed background with a, b, c parameters. The resonance expression has parameters Γ, M 0. We allow a fit for M mn < M 0 < M mx, so we need to keep the likelihood normalized over this interval. There are then 5 independent parameters. The log-likelihood is defined as; l = ln(y i ) y i = M mx F(M i ) M mn dm i F(M i ) Note that the normalization marginalizes out M i in the equation. Remember at each step in an iteration one must keep the likelihood normalized. In the above case the normalization integral can be analytically evaluated. It will depend on the parameters to be fitted, so when the minimization of l is taken, the normalization contributes to the derivative. This procedure while straightforward, requires significant computation. 18

As a final example we wish to fit a circle to the motion of a charged particle moving in a magnetic field. The equations of non-relativistic motion are (the result is also relativisticly correct); m dv x dt m dv y dt m dv z dt = qb z V y = qb z V x = 0 Ignore motion along the field direction. The coupled equations for the velocity and position are solved to obtain; X = X 0 + Rsin(ωt + φ) Y = Y 0 + Rcos(ωt + φ) ω = qb z m There are 3 unknown parameters, the center of the circular motion (X 0, Y 0 ) and the radius of the circle, R. Now we set up a least-square likelihood function in the form; F = n [Di R ] i D i = (X i X 0 ) + (Y i Y 0 ) Then minimize the estimator, F. F X 0 = i F Y 0 = i F R = i Solving these equations; R = X 0 = i N D i FC GB AC B (X i X 0 )[D i R ] = 0 (Y i Y 0 )[D i R ] = 0 [D i R ] = 0 19

Y 0 = GA FB CA B A = ( i X i ) N i X i B = ( i Y i )( i X i ) N i X i Y i C = ( i Y i ) N i Y i F = (1/)[ i X i + i Y i ] i X i (N/)[ i X 3 i + i Y i X i] G = (1/)[ i X i + i Y i ] i Y i (N/)[ i X 3 i + i Y i X i ] Note that the value of R is determined by the mean value of R which weights larger values of R more signifcantly. The error matrix has off diagonal terms so the parameters are correlated. M ij = l X 0 R = 8 i (X i X 0 ) 13 Hypothesis testing In developing a χ (or another test), or in the generation of simulated data to provide design and physical insight into a problem, a random selection of events from a probability distribution is required. This selection is initially begun by obtaining a set of random numbers from a computer program. Obviously, truly random numbers cannot be calculated, but a pseudo-random string of numbers which passes certain checks on randomness are available in computer routines. The simplest of these are linear congruential generators and have a finite length before they repeat, depending on their sophistication and the number of computer bits. There also can be correlations between random numbers in the string. Therefore, one should always use a well designed random number generator. The selection of a random deviate from any probability distribution P(x) can be obtained using the rejection method. Two random numbers are selected a 1 and a to first choose the variable, a 1 = x max, and then select the probability height, a = f max (x). If a P(x) then a random event is selected,otherwise the process is repeated. This technique is illustrated in Figure 5 The procedure outlined in the last section describes how well a model fits the data. The χ formulation is expected to be normally distributed about the mean value of the degrees of freedom with a variance equal to σ. Using the error function we can then determine the probability, F, that the χ curve lies exceeds the value of χ. If the experiment were 0

nd random deviate P(x) accept reject F(x) 1 st random deviate x Figure 5: A graphical illustration of the rejection method to select an event from a general probability distribution repeated N times the obtained value of χ would exceed the average value by F N. After completion of the analysis, there are possibilities; 1. We could reject a true hypothesis due to a statistical fluctuation. We could accept a false hypothesis due to a good χ 3. When testing a hypothesis one should test the entire distribution 4. One could reject data points using a χ cut that lie outside some value. However, this is dangerous as it may effect the statistical analysis. Note that we have no way of testing all hypotheses so we can t direct compare them and test the probability of the hypothesis being correct. We can test only how well a particular model represents the data. 14 Combining measurements Suppose we have a measurement and obtain a probability that a set of variables, a 1 (i), with errors, σ 1 (i), representing the data with a minimum in χ. Now suppose we take a 1

nd measurement obtaining the values a (i) and σ (i). Bayes theorem can be used to show that these may be combined to obtain a new set of parameters and errors. For statistically independent measurements of the same variable; a(i) = a 1(i)/σ 1 (i) + a (i)/σ (i) 1/σ 1 (i) + 1/σ (i) 1/σ = 1/σ 1 (i) + 1/σ (i) If not statistically independent, try to find an independent set of variables, and then combine the errors, transforming the matrix of errors back to the original set of variables. If this cannot be done, then use the covariance to obtain the off diagonal terms and develop the variance as described in an earlier section. Near the edge of a physical region one can apply Bayes theorem. Suppose one is measuring a mass near zero. The mass cannot be negative, so the result must have a lower bound. Using Bayes theorem one uses a prior which excludes the possibility that the mass 0. Using a flat prior, for example; P(Model D) = 1 0 M M max 0 otherwise Then the normalization must be over 0 M max as the insertion of posterior and prior assumptions are iterated. The Likelihood simply contains the probability of the instrumental efficiency for the detection of the mass. 14.1 Uncertainty due to systematic error Suppose a measurement with a normally distributed result. There are also calibration errors and other instrumental errors which are not associated with event counting. Generally, inclusion of systematic error is difficult to handle unless this error can be modeled. If the systematic errors are normally distributed these could be included by addition in quadratures, assuming statistical independence. σ = σ (counting) + σ (i) Most often the systematic error is not normal, and for other distributions the error model should be entered into a probability density function. A problem arises due to the fact that the measurements produce correlated errors so that the likelihood function is not a product of the separate probabilities. However, one may assume that the sum and difference of the errors is normally distributed and then compute an average value of the error as explained in a section above. As an example, suppose a measurement with a normally distributed signal. We also as-

sume a normally distributed statistical error. Assume that these are independent variables. The prior probability is then assumed to be; P(M x, z) = P(M 1 x)p(m 1 z) = A 1 e σ z z πσz The Likelihood is; P(x µ x, z) = 1 e (x µ x z) σ πσ P(µ x x ) = 1 πσσ z dz e (x µ x z) /σ e (z) /σ z /Norm Norm = 1 πσσ z dz dµx e (x µx z) /σ e (z) /σ z This results in the expected; P(µ x x ) = 1 πσ T e (x µx) /σ T In this case the systematic error is included as would be any independent variable. If the error has a mean different from zero, this would be handled by inserting z (z z 0 ). In the situation where variables are measured with the same apparatus, they will be correlated though the systematic error. The inclusion of systematic error can be handled it can be modeled. 15 Example Suppose an instrument measuring energy. Its absolute energy calibration is assumed t0 lie between 1-10%. The statistical error of a test measurement is 18%. The measured energy is 30 units. Proceed as follows. 1. The energy is corrected using the best guess for the calibration; (10-1)%/ = 5 %. This has a statistical uncertainty of 18/ E % 30(1 + 0.5 ± 0.18/ 30(1.05)) = 31.5 ± 1.0. Include the uncertainty in the calibration constant. This uncertainty is constant between (1-10)% so the standard deviation for a flat distribution is (0.1/ 1)31.5 = 0.9 3. Thus σ = (1) + (0.9) = 1.8 σ 1.3 3

16 Robust estimators In any experimental data there are always some points that lie well outside reasonable error. Certainly such points occur for a normal distribution, and these points, particularly when using a χ test, can influence the result. Note that points far from the mean, contribute the most in χ. Robust estimators are less sensitive to these points, and can be used in some cases to trim the data. BUT Be Careful The use of these estimators is not based on statistics. One such estimator uses the likelihood function; ρ = Y i y(x i ) σ i This equally weights outlying points with those close to the mean. Another distribution which results in a Lorentzian probability is; ρ = ln[1 + (1/)( Y i y(x i ) σ i )] With this later distribution the weighting factor increases and then decreases as ( Y i y(x i ) σ i )] increases. An illustration of the application of a robust estimator is shown in figure 6 It is also possible to remove a number, n/, of events furthest above the mean and redetermine the estimators. Then remove n/ of the events furthest below the mean and re-determine the estimators. Then take the average of the two results. This assumes, of course that the distribution has a mean. As indicated previously, the Cauchy distribution does not have a mean. Table 1 gives the best mean estimator for various distributions. Table 1: The table shows best estimator of the mean for various probability distributions Distribution Normal Uniform Cauchy Minimum Variance Estimator Simple Mean Midrange Maximum Likelihood Estimate 4

Figure 6: An illustration of the application of a robust estimator to better fit a straight line to data with outliers 5