ACKGROUND NOTES FYS 4550/FYS9550 - EXERIMENTAL HIGH ENERGY HYSICS AUTUMN 2016 ROAILITY A. STRANDLIE NTNU AT GJØVIK AND UNIVERSITY OF OSLO
efore embarking on the concept of probability, we will first define a set of other concepts. A stochastic experiment is characterized by: All possible elementary outcomes of the experiment are known Only one of the outcomes can occur in a single experiment The outcome of an experiment is not known a priori Example: throwing a dice Outcomes are: S={1,2,3,4,5,6} Can only observe one of these each time you throw Don t know beforehand what you will observe The set S is called the sample space of the experiment
An event A is one or more outcomes which satisfy certain specifications Example: A= odd number when throwing a dice An event is therefore also a subset of S Here: A={1,3,5} If = even number, what is the subset of S describing? The probability of occurence of an event A, A, is a number between 0 and 1 Intuitively a number for A close to 0 means that A is supposed to occur very rarely in an experiment, whereas a number close to 1 means that A occurs very often
There are three ways of quantifying probability 1. Classical approach, valid when all outcomes can be assumed equally likely. robability is defined as number of favourable outcomes for a given event divided by total number of outcomes. Example: throwing a dice has N=6 different outcomes. Assume that the event A = observing 6 eyes. Only n=1 of the outcomes are favourable for A. A=n/N=1/6=0.167. 2. Approach based on convergence value of relative frequency for a very large number of repeated, identical experiments. Example: throwing a dice, recording relative frequency of occurence of A for various numbers of trials 3. Subjective approach, reflecting degree of belief of occurence of a certain event A. ossible guideline: convergence value of a large number of hypothetical experiments
Convergence of relative frequency true probability relative frequency logarithm base 10 of trials
Approach 2 forms the basis of frequentist statistics, whereas approach 3 is the baseline of ayesian statistics Two different schools When estimating parameters from a set of data, the two approaches usually give the same numbers for the estimates if there is a large amount of data If there is little available data, estimates might differ No easy way of determining which approach is best oth approaches advocated in high-energy physics experiments Will not enter any further into such questions in this course
Will now look at probabilities of combinations of events Need some concepts from set theory: The union is a new event which occurs if A or or both events occur. To events are disjoint if they cannot occur simultaneously The intersection A is a new event which occurs if both A and occurs The complement A is a new event which occurs if A does not occur
A robability A S outcomes VENN DIAGRAM A C disjoint with A and
The mathematical axioms of probability: 1. robability is never negative, A 0 2. The probability for the event which corresponds to the entire sample space S i.e. the probability of observing any of the possible outcomes of the experiment is equal to the unit value, i. e. S = 1 3. robability must comply with the addition rule of disjoint events: A A2 An A1 A2 A 1 n A couple of useful formulas which can be derived from the axioms: A 1 A A A A
A robability A Concept of conditional probability: What is the probability of occurence of A given that we know will occur, i. e. A?
Recalling the definition of probability as the number of favourable outcomes divided by the total number of outcomes, we get: Example: throwing dice. A = {2, 4, 6}, = {3, 4, 5, 6} What is A?? 3 1 4,6} { A A / / A N N N N N N A tot tot A A 2 1 2 / 3 1/ 3 A A
A robability A Important observation: and are disjoint! A A
Therefore: Expressing A in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is: where all { } are disjoint and span the entire sample space S. A A A A A A A i i i A A i
From the definition of conditional probability it follows: A quick manipulation gives: which is called ayes theorem. A A A A A A A
y using the law of total probability, one ends up with the general formulation of ayes theorem: which is an extremely important result in statistics. articularly in ayesian statistics this theorem is often used to update or refine the knowledge about a set of unknown parameters by the introduction of information from new data. i i i j j j A A A
This can be explained by a rewrite of ayes theorem: parameters data α data parameters parameters. data parameters is often called the likelihood, parameters denotes the prior knowledge of the parameters, whereas parameters data is the posterior probability of the parameters given the data. If parameters cannot be deduced by any objective means, a subjective belief of its value is used in ayesian statistics. Since there is no fundamental rule describing how to deduce this prior probability, ayesian statistics is still debated also in highenergy physics!
Definition of independence of events A and : A = A, i.e. any given information about does not affect the probability of observing A. hysically this means that the events A and are uncorrelated. For practical applications such independence can not be derived but rather has to be assumed, given the nature of the physical problem one intends to model. General multiplication rule for independent events A, A2,, : 1 A n A 2 A1 n 1 2 n A A A A
Stochastic or random variable: Number which can be attached to all outcomes of an experiment Example: throwing two dice, sum of number of spots Mathematical terminology: real-valued function defined over the elements of the sample space S of an experiment A capital letter is often used to denote a random variable, for instance X Simulation experiment: throwing two dice N times, recording sum of spots each time and calculating the relative frequency of occurence for each of the outcomes
N=10 lue columns: observed rel. freq. Red columns: teoretically expected rel. freq.
N=20 robability
N=100 robability
N=1000 robability
N=10000 robability
N=100000 robability
N=1000000 robability
N=10000000 robability
The relative frequencies seem to converge towards the theoretically expected probabilities Such a diagram is an expression of a probability distribution: A list of all different values of a random variable together with the associated probabilities Mathematically: a function fx = X=x defined for all possible values x of X given by the experiment at hand The values of X can be discrete like in the previous example, or continuous For continuous x, fx is called a probability density function Simulation experiment: height of Norwegian men Collecting data, calculating relative frequencies of occurences in intervals of various widths
interval width 10 cm
interval width 5 cm
interval width 1 cm
interval width 0.5 cm
interval width 0 continuous probability distribution
Cumulative distribution function: Fa=X a For discrete, random variables: For continuous, random variables: a x a x i i i i x f x X a F a dx x f a F
It follows: a X b F b F a For continuous variables: a X b f x dx b a
shaded area is a < X < b a b
shaded area is X < b b
shaded area is X > a a
A function ux of a random variable X is also a random variable. The expectation value of such a function is: u X Two very important special cases are: E u x f x dx E X x f x dx mean 2 2 X 2 Var X E x f x dx variance
The mean μ is the most important measure of the centre of the distribution of X. The variance, or its square root σ, the standard deviation, is the most important measure of the spread of the distribution of X around the mean. The mean is the first moment of X, whereas the variance is the second central moment of X. In general, the n th moment of X is n X n n E x f x dx
The n th central moment is n X n mn E 1 x 1 f x dx Another measure of the centre of the distribution of X is the median, defined as 1 F x med 2 or, in words, the value of of X of which half of the probability lies above and half lies below.
Assume now that X and Y are two random variables with a joint probability distribution function pdf fx,y. The marginal pdf of X is f 1 x f x, y dy whereas the marginal pdf of Y is f 2 y f x, y dx
The mean values of X and Y are X x f x, y dxdy x f1 x Y y f x, y dxdy y f2 y The covariance of X and Y is dx dy cov X, Y E X X Y Y EXY X Y
If several random variables are considered simultaneously, one frequently arranges the variables in a stochastic or random vector The covariances are then naturally displayed in a covariance matrix T X n X,,, X 2 1 X, cov, cov, cov, cov, cov, cov, cov, cov, cov cov 2 1 2 2 2 1 2 1 2 1 1 1 n n n n n n X X X X X X X X X X X X X X X X X X X
If two variables X and Y are independent, the joint pdf can be written f x, y f1 x f2 y The covariance of X and Y vanishes in this case why?, and the variances add: VX+Y=VX+VY. If X and Y are not independent, the general formula is: VX+Y=VX+VY+2CovX,Y. For n mutually independent random variables the covariance matrix becomes diagonal i.e. all off-diagonal terms are identically zero.
If a random vector is related to a vector X with pdf fx by a function YX, the pdf of Y is where J is the absolute value of the determinant of a matrix J. This matrix is the so-called Jacobian of the transformation from Y to X: Y n Y Y,, 2, 1 Y J y x y g f n n n n y x y x y x y x 1 1 1 1 J
The transformation of the covariance matrix is where the inverse of J is The transformation from x to y must be one-to-one, such that the inverse functional relationship exists. 1 T 1 cov cov J X J Y n n n n x y x y x y x y 1 1 1 1 1 J
Obtaining covy from covx as in the previous slide is a very much used technique in high-energy physics data analysis. It is called linear error propagation and is applicable any time one wants to transform from one set of estimated parameters to another Transformation between different sets of parameters describing a reconstructed particle track Transport of track parameters from one location in a detector to another. Will see examples later in the course
The characteristic function Φu associated with the pdf fx is the Fourier transform of fx: u iux E e Such functions are useful in deriving results about moments of random variables. The relation between Φu and the moments of X are i n n d n du u0 x If Φu is known, all moments of fx can be calculated without the knowledge of fx itself n e iux f x dx f x dx n
Some common probability distributions: inomial distribution oisson distribution Gaussian distribution Chisquare distribution Student s t distribution Gamma distribution We will take a closer look at some of them
inomial distribution: Assume that we make n identical experiments with only two possible outcomes: success or no success The probability of success p is the same for all experiments The individual experiments are independent of each other The probability of x successes out of n trials is then X Example: throwing dice n times Defining event of success to be occurence of six spots in a throw robability p=1/6 n x x x p 1 p n x
probability distribution for number of successes in 5 throws
probability distribution for number of successes in 15 throws
probability distribution for number of successes in 50 throws anything familiar with the shape of this distribution?
Mean value and variance: E X np Var X np1 p Five throws with a dice: E# six spots = 5/6 Var# six spots = 25/36 Std# six spots = 5/6
oisson distribution: Number of occurences of event A per given time length, area, volume, interval is constant and equal to λ. robability distribution of observing x occurences in the interval is X x oth mean value and variance of X is λ. x! Example: number of particles in a beam passing through a given area in a given time must be oisson distributed. If the average number λ is known, the probabilities for all x can be calculated according to the formula above. x e
Gaussian distribution: Most frequently occurring distribution in nature. Most measurement uncertainties, disturbances of directions of charged particles when penetrating through enough matter, number of ionizations created by charged particle in a slab of material etc. follow Gaussian distribution. Main reason: CENTRAL LIMIT THEOREM States that sum of n independent random variables converges to a Gaussian distribution when n is large enough, irrespective of the individual distributions of the variables. Abovementioned examples are typically of this type.
Gaussian probability density function with mean value μ and standard deviation σ: For a random vector X of size n with mean value μ and covariance matrix V the function is multivariate Gaussian distribution: 2 2 2 / 2 2 2 1, ; x e x f μ x V μ x V V μ x 1 2 / 2 1 exp det 2 1, ; T n f
Usual terminology: X ~ Nμ,σ : X is distributed according to a Gaussian normal with mean value μ and standard deviation σ. 68 % of distribution within plus/minus one σ. 95 % of distribution within plus/minus two σ. 99.5 % of distribution within plus/minus three σ. Standard normal variable Z~N0,1: Z=X- μ/ σ Quantiles of the standard normal distribution: z Z z 1 The value is denoted the 100 * α % quantile of the standard normal distribution Such quantiles can be found in tables or by computer programs
10 % quantile
5 % quantile 1.64
95 % of area within plus/ minus 2.5 % quantile 1.96
2 distribution: X,, If are independent, Gaussian random variables, then follow a 1 2 X n 2 i1 distribution with n degrees of freedom. Often used in evaluating level of compatibility between observed data and assumed pdf of the data Example: is position of measurement in a particle detector compatible with the assumed distribution of the measurement? Mean value is n and variance 2n. n X i i 2 i 2
chisquare distribution with 10 degrees of freedom