LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

Size: px

Start display at page:

Download "LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO"

Dana Casey
5 years ago
Views:

1 LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I PROBABILITY AND STATISTICS A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

2 Before embarking on the concept of probability, we will first define a set of other concepts. A stochastic experiment is characterized by: All possible elementary outcomes of the experiment are known Only one of the outcomes can occur in a single experiment The outcome of an experiment is not known a priori Example: throwing a dice Outcomes are: S={1,2,3,4,5,6} Can only observe one of these each time you throw Don t know beforehand what you will observe The set S is called the sample space of the experiment

3 An event A is one or more outcomes which satisfy certain specifications Example: A= odd number when throwing a dice An event is therefore also a subset of S Here: A={1,3,5} If B= even number, what is the subset of S describing B? The probability of occurence of an event A, P(A, is a number between 0 and 1 Intuitively a number for P(A close to 0 means that A is supposed to occur very rarely in an experiment, whereas a number close to 1 means that A occurs very often

4 There are three ways of quantifying probability 1. Classical approach, valid when all outcomes can be assumed equally likely. Probability is defined as number of favourable outcomes for a given event divided by total number of outcomes. Example: throwing a dice has N=6 different outcomes. Assume that the event A = observing 6 eyes. Only n=1 of the outcomes are favourable for A. P(A=n/N=1/6= Approach based on convergence value of relative frequency for a very large number of repeated, identical experiments. Example: throwing a dice, recording relative frequency of occurence of A for various numbers of trials 3. Subjective approach, reflecting degree of belief of occurence of a certain event A. Possible guideline: convergence value of a large number of hypothetical experiments

5 Convergence of relative frequency true probability relative frequency logarithm (base 10 of trials

6 Approach 2 forms the basis of frequentist statistics, whereas approach 3 is the baseline of Bayesian statistics Two different schools When estimating parameters from a set of data, the two approaches usually give the same numbers for the estimates if there is a large amount of data If there is little available data, estimates might differ No easy way of determining which approach is best Both approaches advocated in high-energy physics experiments Will not enter any further into such questions in this course

7 Will now look at probabilities of combinations of events Need some concepts from set theory: The union Α Β is a new event which occurs if A or B or both events occur. To events are disjoint if they cannot occur simultaneously The intersection A B is a new event which occurs if both A and B occurs The complement A is a new event which occurs if A does not occur

8 A Probability A B B S outcomes VENN DIAGRAM A B C (disjoint with A and B

9 The mathematical axioms of probability: 1. Probability is never negative, P(A 0 2. The probability for the event which corresponds to the entire sample space S (i.e. the probability of observing any of the possible outcomes of the experiment is equal to the unit value, i. e. P(S = 1 3. Probability must comply with the addition rule of disjoint events: P A2 An = P( A1 + P( A + P( A ( A1 2 n A couple of useful formulas which can be derived from the axioms: P( A = 1 P(A B = P( A + P( B P( A P( A B

10 A Probability A B B Concept of conditional probability: What is the probability of occurence of A given that we know B will occur, i. e. P(A B?

11 Recalling the definition of probability as the number of favourable outcomes divided by the total number of outcomes, we get: Example: throwing dice. A = {2, 4, 6}, B = {3, 4, 5, 6} What is P(A B?? 3 1 ( 4,6} { = = B A P B A ( ( / / ( B P B A P N N N N N N B A P tot B tot B A B B A = = = / 3 1/ ( ( ( = = = B P B A P B A P

12 A B Probability A B B B Important observation: and are disjoint! A B A B

13 Therefore: Expressing P(A in terms of a subdivision of S in a set of other, disjoint events is called the law of total probability. The general formulation of this law is: where all { } are disjoint and span the entire sample space S. ( ( ( ( ( ( ( (( ( B P B A P B P B A P B A P B A P B A B A P A P + = + = = = i i B i P B A P A P ( ( ( B i

14 From the definition of conditional probability it follows: A quick manipulation gives: which is called Bayes theorem. ( ( ( ( ( A B P A P B A P B P B A P = = ( ( ( ( A P B P B A P A B P =

15 By using the law of total probability, one ends up with the general formulation of Bayes theorem: which is an extremely important result in statistics. Particularly in Bayesian statistics this theorem is often used to update or refine the knowledge about a set of unknown parameters by the introduction of information from new data. = i i i j j j B P B A P B P B A P A B P ( ( ( ( (

16 This can be explained by a rewrite of Bayes theorem: P(parameters data α P(data parameters P(parameters. P(data parameters is often called the likelihood, P(parameters denotes the prior knowledge of the parameters, whereas P(parameters data is the posterior probability of the parameters given the data. If P(parameters cannot be deduced by any objective means, a subjective belief of its value is used in Bayesian statistics. Since there is no fundamental rule describing how to deduce this prior probability, Bayesian statistics is still debated (also in highenergy physics!

17 Definition of independence of events A and B: P(A B = P(A, i.e. any given information about B does not affect the probability of observing A. Physically this means that the events A and B are uncorrelated. For practical applications such independence can not be derived but rather has to be assumed, given the nature of the physical problem one intends to model. General multiplication rule for independent events A, A2,, : 1 A n P A 2 An = P( A1 P( A ( A1 2 n P( A

18 Stochastic or random variable: Number which can be attached to all outcomes of an experiment Example: throwing two dice, sum of number of spots Mathematical terminology: real-valued function defined over the elements of the sample space S of an experiment A capital letter is often used to denote a random variable, for instance X Simulation experiment: throwing two dice N times, recording sum of spots each time and calculating the relative frequency of occurence for each of the outcomes

19 N=10 Blue columns: observed rel. freq. Red columns: teoretically expected rel. freq.

20 N=20 Probability

21 N=100 Probability

22 N=1000 Probability

23 N=10000 Probability

24 N= Probability

25 N= Probability

26 N= Probability

27 The relative frequencies seem to converge towards the theoretically expected probabilities Such a diagram is an expression of a probability distribution: A list of all different values of a random variable together with the associated probabilities Mathematically: a function f(x = P(X=x defined for all possible values x of X (given by the experiment at hand The values of X can be discrete (like in the previous example, or continuous For continuous x, f(x is called a probability density function Simulation experiment: height of Norwegian men Collecting data, calculating relative frequencies of occurences in intervals of various widths

28 interval width 10 cm

29 interval width 5 cm

30 interval width 1 cm

31 interval width 0.5 cm

32 interval width 0 continuous probability distribution

33 Cumulative distribution function: F(a=P(X a For discrete, random variables: For continuous, random variables: = = = a x a x i i i i x f x X P a F ( ( ( = a dx x f a F ( (

34 It follows: P( a < X b = F( b F( a For continuous variables: P ( a < X b = f ( x dx b a

35 shaded area is P(a < X < b a b

36 shaded area is P(X < b b

37 shaded area is P(X > a a

38 A function u(x of a random variable X is also a random variable. The expectation value of such a function is: [ u( X ] Two very important special cases are: E = u( x f ( x dx 2 σ µ = E ( X = x f ( x dx 2 [( X µ ] = 2 = Var ( X = E ( x µ f ( x dx mean variance

39 The mean μ is the most important measure of the centre of the distribution of X. The variance, or its square root σ, the standard deviation, is the most important measure of the spread of the distribution of X around the mean. The mean is the first moment of X, whereas the variance is the second central moment of X. In general, the n th moment of X is α [ ] n X = n = E x f ( x dx n

40 The n th central moment is m [ ] n n n E ( X α 1 = ( x α1 f ( x = dx Another measure of the centre of the distribution of X is the median, defined as 1 F( x med = 2 or, in words, the value of of X of which half of the probability lies above and half lies below.

41 Assume now that X and Y are two random variables with a joint probability distribution function (pdf f(x,y. The marginal pdf of X is f x = f ( x, y dy 1 ( whereas the marginal pdf of Y is f 2( y = f ( x, y dx

42 The mean values of X and Y are µ µ X x f ( x, y dxdy = x f1( x = dx Y y f ( x, y dxdy = y f2( y = dy The covariance of X and Y is cov [ X, Y ] = E[ ( X µ X ( Y µ Y ] = E[ XY ] µ X µ Y

43 If several random variables are considered simultaneously, one frequently arranges the variables in a stochastic or random vector X = (, X,, T X1 2 X n The covariances are then naturally displayed in a covariance matrix cov ( X = cov( X cov( X cov( X 1 2 n,,, X X X cov( X cov( X cov( X 1 2 n,,, X X X cov( X cov( X cov( X 1 2 n,,, X X X n n n

44 If two variables X and Y are independent, the joint pdf can be written f ( x, y = f1( x f2( y The covariance of X and Y vanishes in this case (why?, and the variances add: V(X+Y=V(X+V(Y. If X and Y are not independent, the general formula is: V(X+Y=V(X+V(Y+2Cov(X,Y. For n mutually independent random variables the covariance matrix becomes diagonal (i.e. all off-diagonal terms are identically zero.

45 Y = ( Y, Y, 2, If a random vector is related to a vector X (with pdf f(x by a function Y(X, the pdf of Y is 1 Y n g( y = f ( x( y J where J is the absolute value of the determinant of a matrix J. This matrix is the so-called Jacobian of the transformation from Y to X: J = x y x y 1 1 n 1 x y x y 1 n n n

46 The transformation of the covariance matrix is where the inverse of J is The transformation from x to y must be one-to-one, such that the inverse functional relationship exists. 1 T 1 cov( cov( = J X J Y = n n n n x y x y x y x y J

47 Obtaining cov(y from cov(x as in the previous slide is a very much used technique in high-energy physics data analysis. It is called linear error propagation and is applicable any time one wants to transform from one set of estimated parameters to another Transformation between different sets of parameters describing a reconstructed particle track Transport of track parameters from one location in a detector to another. Will see examples later in the course

48 The characteristic function Φ(u associated with the pdf f(x is the Fourier transform of f(x: φ( u [ ] iux = Such functions are useful in deriving results about moments of random variables. The relation between Φ(u and the moments of X are i n n d φ n du If Φ(u is known, all moments of f(x can be calculated without the knowledge of f(x itself iux = E e e f ( x dx u= 0 n = x f ( x dx = α n

49 Some common probability distributions: Binomial distribution Poisson distribution Gaussian distribution Chisquare distribution Student s t distribution Gamma distribution We will take a closer look at some of them

50 Binomial distribution: Assume that we make n identical experiments with only two possible outcomes: success or no success The probability of success p is the same for all experiments The individual experiments are independent of each other The probability of x successes out of n trials is then P( X x n p x x = = 1 ( n x p Example: throwing dice n times Defining event of success to be occurence of six spots in a throw Probability p=1/6

51 probability distribution for number of successes in 5 throws

52 probability distribution for number of successes in 15 throws

53 probability distribution for number of successes in 50 throws anything familiar with the shape of this distribution?

54 Mean value and variance: E( X = np Var( X = np(1 p Five throws with a dice: E(# six spots = 5/6 Var(# six spots = 25/36 Std(# six spots = 5/6

55 Poisson distribution: Number of occurences of event A per given time (length, area, volume, interval is constant and equal to λ. Probability distribution of observing x occurences in the interval is P( X = x = e x! Both mean value and variance of X is λ. Example: number of particles in a beam passing through a given area in a given time must be Poisson distributed. If the average number λ is known, the probabilities for all x can be calculated according to the formula above. λ x λ

56 Gaussian distribution: Most frequently occurring distribution in nature. Most measurement uncertainties, disturbances of directions of charged particles when penetrating through (enough matter, number of ionizations created by charged particle in a slab of material etc. follow Gaussian distribution. Main reason: CENTRAL LIMIT THEOREM States that sum of n independent random variables converges to a Gaussian distribution when n is large enough, irrespective of the individual distributions of the variables. Abovementioned examples are typically of this type.

57 Gaussian probability density function with mean value μ and standard deviation σ: f ( x; µ, σ 2 = 1 2 2πσ e 2 ( x µ 2 / 2 σ For a random vector X of size n with mean value μ and covariance matrix V the function is (multivariate Gaussian distribution: f ( x; μ, V = ( 2π n / 2 1 det( V exp 1 2 T 1 ( x μ V ( x μ

58 Usual terminology: X ~ N(μ,σ : X is distributed according to a Gaussian (normal with mean value μ and standard deviation σ. 68 % of distribution within plus/minus one σ. 95 % of distribution within plus/minus two σ % of distribution within plus/minus three σ. Standard normal variable Z~N(0,1: Z=(X- μ/ σ Quantiles of the standard normal distribution: z α P( Z z < 1 = α α The value is denoted the 100 * α % quantile of the standard normal distribution Such quantiles can be found in tables or by computer programs

59 10 % quantile

60 5 % quantile (1.64

61 95 % of area within plus/ minus 2.5 % quantile (1.96

62 2 χ distribution: { X,, } If are independent, Gaussian random variables, then follow a X n n ( 2 X = i µ i χ χ i= 1 σ distribution with n degrees of freedom. Often used in evaluating level of compatibility between observed data and assumed pdf of the data Example: is position of measurement in a particle detector compatible with the assumed distribution of the measurement? Mean value is n and variance 2n. i 2

63 chisquare distribution with 10 degrees of freedom

64 Statistics Statistics is about making inference about a statistical model, given a set of data or measurements Parameters of a distribution Parameters describing the kinematics of a particle after a collision Position and momentum at some reference surface Parameters describing an interaction vertex (position, refined estimates of particle momenta Will consider two issues Parameter estimation Hypothesis tests and confidence intervals

65 Statistics Parameter estimation We want to estimate the unknown value of a parameter θ. An estimator is a function of the data which aims to estimate the value of θ as closely as possible. General estimator properties θ^ Consistency Bias Efficiency Robustness A consistent estimator is an estimator which converges to the true value of θ when the amount of data increases (formally, in the limit of infinite amount of data.

66 Statistics The bias b of an estimator is given as b = ^ E θ θ Since the estimator is a function of the data, it is itself a random variable with its own distribution. The expectation value of θ can be interpreted as the mean value of the estimate for a very large number of hypothetical, identical experiments. Obviously, unbiased (i.e. b=0 estimators are desirable.

67 Statistics The efficiency of an estimator is the inverse of the ratio of its variance to the minimum possible value. The minimum possible value is given by the Rao-Cramer-Frechet lower bound where I(θ is the Fisher information: ( min θ θ σ I b + = = 2 ; ( ln E ( i x i f I θ θ θ

68 Statistics The sum is over all the data, which are assumed independent and to follow the pdf f(x; θ. The expression of the lower bound is valid for all estimators with the same bias function b(θ (for unbiased estimators b(θ vanishes. If the variance of the estimator happens to be equal to the Cramer-Rao- Frechet lower bound, it is called a minimum variance lower bound estimator or a (fully efficient estimator. Different estimators of the same parameter can also be compared by looking at the ratios of the efficiencies. One then talks about relative efficiencies. Robustness is the (qualitative degree of insensitivity of the estimator to deviations in the assumed pdf of the data e.g. noise in the data not properly taken into account wrong data etc

69 Statistics Common estimators for the mean and variance are (often called the sample mean and the sample variance: The variances of these are: ( = = = = N i i N i i x x N s x N x = = ( ( σ σ N N m N s V N x V

70 Statistics For variables which obey the Gaussian distribution, this yields for large N std( s = σ 2N For Gaussian variables the sample mean is a fully efficient estimator. If the different measurements used in the calculation of the sample mean have different variances, a better estimator of the mean is a weighted sample mean: 1 = i xw i 2 σ i i σ i 1 2 x

71 Statistics The method of maximum likelihood: Assume that we have N independent measurements all obeying the pdf f(x;θ, where θ is a parameter vector consisting of n different parameters to be estimated. The maximum likelihood estimate is the value of the parameter vector θ which maximizes the likelihood function L N ( θ = f ( ; θ i= 1 Since the natural logarithm is a monotoneously increasing function, ln(l and L will have maximum for the same value of θ. x i θ

72 Statistics Therefore the maximum likelihood estimate can be found by solving the likelihood equations ln L θ for all i=1,..,n. ML-estimators are asymptotically (i.e. for large amounts of data unbiased and fully efficient Therefore very popular An estimate of the inverse of the covariance matrix of an MLestimate is ( 1 V evaluated at the estimated value of θ. ij i i = 2 ln L = θ θ j 0

73 Statistics The method of least squares. Simplest possible example: estimating the parameters of a straight line (intercept and tangent of inclination angle given a set of measurements. measurements fitted line

74 Statistics Least-squares approach: minimizing the sum of squared distances S between the line and the N measurements, S = N ( y i ( ax σ + b i= 1 variance of measurement i error with respect to the parameters of the line (i.e. a and b. This cost function or objective function S can be written in a more compact way by using matrix notation: i 2 2 S = ( T 1 y Hθ V ( y Hθ

75 Statistics Here y is a vector of measurements, θ is a vector of the parameters a and b, V is the (diagonal covariance matrix of measurements (consisting of the individual variances on the main diagonal, and H is given by 1 x1 H = 1 x N Taking the derivative of S with respect to θ, setting this to zero and solving for θ yields the least-squares solution to the problem.

76 Statistics The result is: θ ( T 1 1 T 1 H V H H y = V The covariance matrix of the estimated parameters is: cov ( ( 1 = H T V H 1 θ and the covariance matrix of the estimated positions cov ( ( T 1 1 T = H H V H H y y = Hθ is

77 Statistics Simulating lines Histogram of value of estimated intercept What is true value of intercept?

78 Statistics Simulating lines Histogram of value of tangent of angle of inclination What is true value?

79 Statistics Histograms of normalized residuals of estimated parameters. This means that for each fitted line and each estimated parameter, a quantity ((estimated parameter-true parameter/standard deviation of parameter is put into the histogram. If everything is OK with the fitting procedure, these histograms should have mean 0 and standard deviation 1. mean= std= mean= std=1.0011

80 Statistics Least-squares estimation is for instance used in track fitting in highenergy physics experiments. Track fitting is basically the same task as the line fit example: estimating a set of parameters describing a particle track through a tracking detector, given a set of measurements created by the particle. In the general case the track model is not a straight line but rather a helix (homogeneous magnetic field or some other trajectory obeying the equations of motion in an inhomogeneous magnetic field. The principles of the fitting procedure, however, are largely the same.

81 Statistics As long as there is a linear relationship between the parameters and the measurements, the least-squares method is linear. If this relationship is a non-linear function F(θ, the problem is said to be of a non-linear least-squares type: S = ( T 1 y F( θ V ( y F( θ There exists no direct solution to this problem, and one has to resort to an iterative approach (Gauss-Newton: Start out with an initial guess of θ, linearize function F around the initial guess by a Taylor expansion and solve the resulting linear least-squares problem Use the estimated value for θ as a new expansion point for F and repeat the step above Iterate until convergence (i.e. until θ changes less than a specified value from one iteration to the next

82 Statistics Relationship between maximum likelihood and least-squares: Consider a set of independent measurements y with mean values F(x;θ. If these measurements follow a Gaussian distribution, the loglikelihood function is basically 2ln L( θ = N i= 1 ( y F( ; θ i x i 2 σ i 2 plus some terms which do not depend on θ. Maximizing the log-likelihood function is in this case equivalent to minimizing the least-squares objective function.

83 Statistics Confidence intervals and hypothesis tests. Confidence intervals: Given a set of measurements of a parameter, calculate an interval that one can be e.g. 95 % sure that the true value of the parameter is within Such an interval is called a 95 % confidence interval of a parameter Example: collect N measurements believed to come from a Gaussian distribution with unknown mean value μ and known standard deviation σ. Use the sample mean value to calculate a 100(1-α % confidence interval for μ. From earlier: the sample mean is an unbiased estimator for μ with standard deviation σ/sqrt(n. X µ For large enough N, the quantity Z = is distributed according to a standard, normal distribution σ / N (mean value 0, standard deviation 1

84 Statistics Therefore: In words, there is a probability 1-α that the true mean is in the interval This interval is therefore a 100(1- α % confidence interval for μ. Such intervals are highly relevant in physics analysis. ( ( ( α σ µ σ α σ µ σ α σ µ σ α σ µ α α α α α α α α = + < < = < < = < < = < < 1 / / 1 / / 1 / / 1 / 2 / 2 / 2 / 2 / 2 / 2 / 2 / 2 / N z X N z X P N z X N z P N z X N z P z N X z P [ ] N z X N z X /, / 2 / 2 / σ σ α α +

85 Statistics Hypothesis tests: A hypothesis is a statement about the distribution of a vector x of data. Similar to the previous example: given a number N measurements, test whether the measurements come from a normal distribution with a certain expectation value μ or not. define a test statistic, i.e. the quantity to be used in the evaluation of the hypothesis. Here: the sample mean. define the significance level of the test, i.e. the probability that the hypothesis will be discarded even though it is true. determine the critical region of the test statistic, i.e. interval(s of values of the test statistic which will lead to the rejection of the hypothesis

86 Statistics We then state two competing hypotheses: A null hypothesis, stating that the expectation value is equal to a given value An alternative hypothesis, stating that the expectation value is not equal to the given value Mathematically: H H 0 1 : µ = : µ µ µ 0 0 Test statistic: Z = X σ / µ 0 N

87 Statistics Obtain a value of the test statistic from test data by calculating the sample mean and transforming to Z. Use the actual value of Z to determine whether the null hypothesis is rejected or not. Probability of being in shaded area: α Shaded area is therefore the critical region of Z for significance level α z α / 2 z α / 2

88 Statistics Alternatively: perform the test by calculating the so-called p-value of the test statistic. Given the actual value of the test statistic, what is the area below the pdf for the range of values of the test statistic starting from the actual one and extending to all values further away from the value defined by the null hypothesis? This area defines the p-value. For the current example this would correspond to adding two integrals of the pdf of the test statistic (because this is a so-called two-sided test: one from minus infinity to minus the absolute value of the actual value of the test statistic another from the absolute value of the actual value of the test statistic to plus infinity For a one-sided test one would stick to one integral of the abovementioned type If the p-value is less than the significance level: discard the null hypothesis. If not, don t discard it.

89 Statistics p-values can be used in so-called goodness-of-fit tests. In such tests one frequently uses a test statistic which is assumed to be chisquare distributed Is a measurement in a tracking detector compatible with belonging to a particle track defined by a set of other measurements? Is a histogram with a set of entries in different bins compatible with an expected histogram (defined by an underlying assumption of the distribution? Is the residual distributions of estimated parameters compatible with the estimated covariance matrix of the parameters? If one can calculate many independent values of the test statistic, the following procedure is often applied: Calculate the p-value of the test statistic each time the test statistic is calculated

90 Statistics The p-value itself is also a random variable, and it can be shown that it is distributed according to a uniform distribution if the test statistic origins from the expected (chisquare distribution. Create a histogram with the various p-values as entries and see whether it looks reasonably flat NB! With only one calculated p-value, the null hypothesis can be rejected but never confirmed! With many calculated p-values (as immediately above the null hypothesis can also (to a certain extent be confirmed! Example: line fit (as before For each fitted line, calculate the following chisquare: 2 χ = 1 ( T ( θ θ cov θ ( θ θ

91 Statistics Here θ is the true value of the parameter vector. For each value of the chisquare, calculate the corresponding p-value Integral of chisquare distribution from the value of the chisquare to infinity Given in tables or in standard computer programs (CERNLIB, CLHEP, MATLAB,. Fill up a histogram with the p-values and make a plot: Reasonably flat histogram, seems OK. What we really test here is that the estimated parameters are unbiased estimates of the true parameters, distributed according to a Gaussian with a covariance matrix as obtained in the estimate!!

BACKGROUND NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2016 PROBABILITY A. STRANDLIE NTNU AT GJØVIK AND UNIVERSITY OF OSLO

BACKGROUND NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2016 PROBABILITY A. STRANDLIE NTNU AT GJØVIK AND UNIVERSITY OF OSLO ACKGROUND NOTES FYS 4550/FYS9550 - EXERIMENTAL HIGH ENERGY HYSICS AUTUMN 2016 ROAILITY A. STRANDLIE NTNU AT GJØVIK AND UNIVERSITY OF OSLO efore embarking on the concept of probability, we will first define