Statistical foundations of machine learning

Size: px

Start display at page:

Download "Statistical foundations of machine learning"

Nora Waters
5 years ago
Views:

1 Statistical foundations of machine learning INFO-F-422 Gianluca Bontempi Machine Learning Group Computer Science Department mlg.ulb.ac.be

2 Random experiment We define as random experiment any action or process which generates results or observations which cannot be predicted with certainty. Examples: tossing a coin, rolling dice, measuring the commute time to go back home

3 Probability space A random experiment is characterized by a sample space Ω that is the set of all the possible outcomes ω of the experiment. This space can be either finite or infinite. For example, in the die experiment, Ω = {ω 1,ω 2,...,ω 6 }, in the commute time example Ω = {ω LOW,ω MEDIUM,ω HIGH } The elements of the set Ω are called experimental outcomes. The outcome of an experiment need not be a number, for example, the outcome when a coin is tossed can be heads or tails. A subset of experimental outcomes is called event. An example of event is the set of even values E = {ω 2,ω 4,ω 6 } or the set of non high times E = {ω LOW,ω MEDIUM }. A single execution of a random experiment is called a trial. At each trial we observe one outcome ω i. We say that an event occurred during this trial if it contains the element ω i. For example, in the die experiment, if we observe the outcome ω 4 the event even took place.

4 Events and set theory Since events E are subsets, we can apply to them the terminology of the set theory: E c = {ω Ω : ω / E} denotes the complement of E. E 1 E 2 = {ω Ω : ω E 1 OR ω E 2 } refers to the event that occurs when E 1 or E 2 or both occur. E 1 E 2 = {ω Ω : ω E 1 AND ω E 2 } refers to the event that occurs when both E 1 and E 2 occur. two events E 1 and E 2 are mutually exclusive or disjoints if E 1 E 2 = that is each time that E 1 occurs, E 2 does not occur. a partition of Ω is a set of disjoint sets E j, j = 1,...,n such that n j=1e j = Ω

5 Class of events The class {E} of events is not an arbitrary collection of subsets of Ω. We want that if E 1 and E 2 are events also that the intersection E 1 E 2 and the union E 1 E 2 be events. We do so because we will want to know not only the probabilities of various events, but also the probabilities of their unions and intersections. In the following we will consider only events that, in mathematical terms, form a Borel field.

6 Combined experiments Note that the same sample space is not necessarily univariate. The most interesting use of probability concerns however combined random experiments whose sample space Ω = Ω 1 Ω 2...Ω n is the Cartesian product of several spaces Ω i, i = 1,...,n. For instance if we want to study the probabilistic dependence between the height and the weight of a child we have to define a joint sample space Ω = {(w,h) : w Ω w,h Ω h } made of all pairs (w,h) where Ω w is the sample space of the random experiment describing the weight and Ω h is the sample space of the random experiment describing the height.

7 Axiomatic definition of probability Let Ω the certain event that occurs in every trial and let E 1 +E 2 design the event that occurs when E 1 or E 2 or both occur. The axiomatic approach to probability consists in assigning to each event E a number Prob{E} which is called the probability of the event E. This number is so chosen as to satisfy the following three conditions: 1 Prob{E} 0 for any E. 2 Prob{Ω} = 1 3 Prob{E 1 +E 2 } = Prob{E 1 }+Prob{E 2 } if Prob{E 1 E 2 } = 0 that is E 1 and E 2 are mutually exclusive (or disjoint). These conditions are the axioms of the theory of probability (Kolmogoroff,1933). It follows that Prob{E} = ω E Prob{ω}

8 Axiomatic definition of probability All probabilistic conclusions are based directly or indirectly on the axioms and only the axioms. But how to define these probability numbers?

9 Symmetrical definition of probability Consider a random experiment where the sample space is made of N symmetric outcomes, i.e. we have no reason to expect or prefer one over the other the number of outcomes which are favorable to the event E (i.e. if they occur then the event E takes place) is N E. then according to the principle of indifference (a term popularized by J.M. Keynes in 1921) we have Prob{E} = N E N Note that this number is determined without any experimentation and is based on symmetrical assumptions. But what happen if symmetry does not hold?

10 Frequentist definition of probability Let us consider a random experiment and an event E. Let us repeat the experiment N times and let us compute the number of times that the event E occurs. The quantity NE N is the relative frequency of E. It can be empirically observed that the frequency converges to a fixed value for increasing N when the experiment is run a large number of times under exactly the same situations and in a way so that the repetitions of the experiment are independent of each other. This observation led Von Mises to use the notion of frequency as a foundation for the notion of probability.

11 Frequentist definition of probability Definition It is based on the following definition (Von Mises): The probability Prob{E} of an event E is the limit N E Prob{E} = lim N N where N is the number of observations (trials) and N E is the number of times that E occurred. This definition appears reasonable and it is compatible with the axioms. According to this approach, the probability is not a property of a specific observation, but rather it is a property of the entire set of observations. In practice, in any physical experience N is finite and the limit has to be accepted as an hypothesis, not as a number that can be determined experimentally.

12 Weak law of Large Numbers A link between the axiomatic and the frequentist approach is provided by the weak law of Large Numbers. Theorem (Bernoulli) For any ǫ > 0, { } N E Prob N p ǫ 1 as N ratio N E /N is close to p in the sense that, for any ǫ > 0, the probability that N E /N p ǫ tends to 1 as N. Law about the long-run behavior not about a single (or the next) experiment In other terms the set of outcomes for which the sequence N E /N does not converge to p is negligibly small. The weak law essentially states that for any nonzero margin specified, no matter how small, with a sufficiently large sample there will be a very high probability that the frequency will be close to the probability, that is, within the margin.

13 Law of Large Numbers and simulation The law of large numbers is also the mathematical basis for the widespread application of computer simulations to solve practical probability problems. In simulation, the (unknown) probability of a given event in a chance experiment is estimated by the relative frequency of occurrence of the event in a large number of computer simulations of the experiment. The application of simulations is based on the elementary principles of probability Stochastic simulation (also known as Montecarlo) is a powerful tool with which extremely complicated probability problems can be solved.

14 Gambler s fallacy Note that according to the law of large numbers In fact, N E /N p for N but NOT that N E Np for N Prob{N E = Np} 1 2πNp(1 p) 0, as N In a fair coin-tossing game the law of large numbers does not imply that the absolute difference between the number of heads and tails should oscillate close to zero. On the contrary it can be shown that the absolute difference keeps growing proportionally to N (and then slower than the number of tosses). The illusion that after a long number of "heads", it is more probable to have a "tail" is also known as the gambler s fallacy. If gamblers were true, this would mean that coins have memory!

15 Probability and frequency Relative frequency Absolute difference (no. heads and tails) e+00 2 e+05 4 e+05 6 e+05 0 e+00 2 e+05 4 e+05 6 e+05 Number of trials Number of trials

16 Some probabilistic notions Definition (Independent events) Two events E 1 and E 2 are independent if and we write E 1 E 2. Examples: Prob{E 1 E 2 } = Prob{E 1 } Prob{E 2 } event E 1 : your professor is Italian, event E 2 : bad weather in Brussels event E 1 : commute time 10 minutes, event E 2 : bad weather in Buenos Aires

17 Some probabilistic notions Definition (Conditional probability) If Prob{E 1 } > 0 then the conditional probability of E 2 given E 1 is Examples: Prob{E 2 E 1 } = Prob{E 1 E 2 } Prob{E 1 } event E 2 : bad weather in Brussels; event E 1 : commute time smaller than average,

18 Exercices 1 Let E 1 and E 2 two disjoint events with positive probability. Can they be independent? 2 Suppose that a fair die is rolled and that the number x appears. Let E 1 be the event that the number x is even, E 2 be the event that the number x is greater than or equal to 3, E 3 be the event that the number x is a 4,5 or 6. Are the events E 1 and E 2 independent? Are the events E 1 and E 3 independent?

19 Warnings For any fixed E 1, the quantity Prob{ E 1 } satisfies the axioms of probability. For instance if E 2, E 3 and E 4 are disjoint events we have that Prob{E 2 E 3 E 4 E 1 } = Prob{E 2 E 1 }+Prob{E 3 E 1 }+Prob{E 4 E 1 } However this does NOT generally hold for Prob{E 1 }, that is when we fix the term E 1 on the left of the conditional bar. For two disjoint events E 2 and E 3, in general Prob{E 1 E 2 E 3 } Prob{E 1 E 2 }+Prob{E 1 E 3 } it is generally NOT the case that Prob{E 2 E 1 } = Prob{E 1 E 2 }.

20 Warnings Examples: The following properties hold : If E 1 E 2 Prob{E 2 E 1 } = Prob{E 1} Prob{E 1 } = 1 (E 1 E 2 ) Prob{E 1 E 2 } = Prob{E 1} Prob{E 2 } Prob{E 1} event E 1 : your professor is Italian, event E 2 : your professor is European event E 1 : commute time 10 minutes, event E 2 : commute time 60 minutes

21 Bayes theorem Let us consider a set of mutually exclusive and exhaustive events E 1, E 2,..., E k, i.e. they form a partition of Ω. Theorem (Law of total probability) Let Prob{E i }, i = 1,...,k denote the probabilities of the ith event E i and Prob{E E i }, i = 1,...,k the conditional probability of a generic event E given that E i has occurred. It can be shown that Prob{E} = k Prob{E E i } Prob{E i } = i=1 k Prob{E E i } Example: how much time will it take tomorrow to go back home by car given the weather forecast? event E: tomorrow commute car time smaller than average event E 1 : nice weather in Brussels, event E 2 : average weather in Brussels, event E 3 : bad (as usual) weather in Brussels i=1

22 Bayes theorem Theorem (Bayes theorem) The conditional ( inverse ) probability of any E i, i = 1,...,k given that E has occurred is given by Prob{E i E} = Prob{E E i } Prob{E i } k j=1 Prob{E E j}prob{e j } = Prob{E E i} Prob{E} i = 1,...,k Example: how probable was the bad time last Wednesday when it took a long going back home by car? event E: commute car time longer than average event E 1 : nice weather in Brussels, event E 2 : bad weather in Brussels

23 Transitivity in logics and probability Let us consider three boolean spaces Ω i = {TRUE,FALSE} and three events E 1,E 2,E 3. From logics we know that if E 1 E 2 and E 2 E 3 then E 1 E 3. Does it hold in probability too? In probabilistic terms we can rewrite the logical implications as Prob{E 2 = T E 1 = T} = 1 and Prob{E 3 = T E 2 = T} = 1. Prob{E 3 = T E 1 = T} = = Prob{E 3 = T E 2 = i,e 1 = T}Prob{E 2 = i,e 1 = T} = i = Prob{E 3 = T E 2 = F,E 1 = T}Prob{E 2 = F E 1 = T} + }{{} 0 + Prob{E 3 = T E 2 = T,E 1 = T}Prob{E 2 = T E 1 = T} = 1

24 Inverse modus ponens in logics and probability According to logic, if E 1 E 2 then E 2 E 1 Does it hold in probability too? In probabilistic terms we can rewrite the logical implications as Prob{E 2 = T E 1 = T} = 1 It follows Prob{E 1 = F E 2 = F} = 1 Prob{E 1 = T E 2 = F} = Prob{E 2 = F E 1 = T} Prob{E 1 = T} }{{} 0 = 1 = 1 Prob{E 2 = F} In other terms deductive logic rules can be seen as limiting cases of probabilistic reasoning.

25 Medical study Let us consider a medical study about the relationship between the outcome of a medical test and the presence of a disease. We model this study as the combination of two random experiments: 1 the random experiment which models the state of the patient. Its sample space is Ω s = {H,S} where H and S stand for healthy and sick patient, respectively. 2 the random experiment which models the outcome of the medical test. Its sample space is Ω o = {+, } where + and stand for positive and negative outcome of the test, respectively. Suppose that out of 1000 patients, E s = S E s = H E o = E o = E s = S E s = H E o = E o = What is the probability of having a positive (negative) test outcome when the patient is sick (healthy)? What is the probability of being in front of a sick (healthy) patient when a positive (negative) outcome is obtained?

26 Medical study (II) From the definition of conditional probability we derive Prob{E o = + E s = S} = Prob{Eo = +,E s = S} Prob{E s = S} Prob{E o = E s = H} = Prob{Eo =,E s = H} Prob{E s = H} = = = =.9 According to these figures, the test appears to be accurate. Do we have to expect a high probability of being sick when the test is positive? The answer is NO as shown by Prob{E s = S E o = +} = Prob{Eo = +,E s = S} Prob{E o = +} = This example shows that sometimes humans tend to confound Prob{E s E o } with Prob{E o E s } and that the most intuitive response is not always the right one.

27 Array of joint/marginal probabilities Let us consider the combination of two random experiments whose sample spaces are Ω A = {A 1,,A n } and Ω B = {B 1,,B m } respectively. Assume that for each pairs of events (A i,b j ), i = 1,...,n, j = 1,...,m we know the joint probability value Prob{A i,b j }. B 1 B 2 Value B m Marginal A 1 Prob{A 1,B 1 } Prob{A 1,B 2 } Prob{A 1,B m} Prob{A 1 } A 2 Prob{A 2,B 1 } Prob{A 2,B 2 } Prob{A 1,B m} Prob{A 2 }... A n Prob{A n,b 1 } Prob{A n,b 2 } Prob{A n,b m} Prob{A n} Marginal Prob{B 1 } Prob{B 2 } Prob{B m} Sum=1... The joint probability array contains all the necessary information for computing all marginal and conditional probabilities Try to fill the table for the dependent and independent case.

28 Dependent/independent: example Let us model the commute time to go back home for an ULB student living in St. Gilles as a random experiment. Suppose that its sample space is Ω={LOW, MEDIUM, HIGH}. Consider also an (extremely:-) random experiment representing the weather in Brussels, whose sample space is Ω={G=GOOD, B=BAD}. Suppose that the array of joint probabilities is G (in Bxl) B (in Bxl) Marginal LOW Prob{LOW} = 0.2 MEDIUM Prob{MEDIUM} = 0.5 HIGH Prob{HIGH} = 0.3 Prob{G} = 0.3 Prob{B} = 0.7 Sum=1 Is the commute time dependent on the weather in Bxl? G (in Rome) B (in Rome) Marginal LOW Prob{LOW} = 0.2 MEDIUM Prob{MEDIUM} = 0.5 HIGH Prob{HIGH} = 0.3 Prob{G} = 0.9 Prob{B} = 0.1 Sum=1 Is the commute time dependent on the weather in Rome?

29 Dependent/independent: example (II) If Brussels weather is good LOW MEDIUM HIGH Prob{ G} 0.15/0.3= /0.3= /0.3=0.16 Else if Brussels weather is bad LOW MEDIUM HIGH Prob{ B} 0.05/0.7= /0.7= /0.7=0.35 The distribution of the commute time changes according to the value of the Brussels weather.

30 Dependent/independent: example (III) If Rome s weather is good LOW MEDIUM HIGH Prob{ G} 0.18/0.9= /0.9= /0.9=0.3 Else if If Rome s weather is bad LOW MEDIUM HIGH Prob{ B} 0.02/0.1= /0.1= /0.1=0.3 The distribution of the commute time does NOT change according to the value of Rome s weather.

31 Marginal/conditional: example Consider a probabilistic model of the day s weather based on three random descriptors (or features) where 1 the first represents the sky condition and takes value in the finite set {CLEAR, CLOUDY}. 2 the second represents the barometer trend and takes value in the finite set {RISING,FALLING}, 3 the third represents the humidity in the afternoon and takes value in {DRY,WET}.

32 Marginal/conditional: example (II) Let the joint distribution be given by the table E 1 E 2 E 3 P(E 1,E 2,E 3 ) CLEAR RISING DRY 0.4 CLEAR RISING WET 0.07 CLEAR FALLING DRY 0.08 CLEAR FALLING WET 0.10 CLOUDY RISING DRY 0.09 CLOUDY RISING WET 0.11 CLOUDY FALLING DRY 0.03 CLOUDY FALLING WET 0.12 From the joint distribution we can calculate the marginal probabilities P(CLEAR,RISING) = 0.47 and P(CLOUDY) = 0.35 and the conditional value P(DRY CLEAR,RISING) = = P(DRY,CLEAR,RISING) P(CLEAR, RISING) =

33 Random variables Machine learning and statistics is concerned with data. What is then the link between the notion of random experiments and data? The answer is provided by the concept of random variable. Consider a random experiment (Ω,{E}, Prob{ }). The outcome of an experiment need not be a number, for example, the outcome when a coin is tossed can be heads or tails. However, we often want to represent outcomes as numbers. A random variable is a function that associates a unique numerical value with every outcome of an experiment. The value of the random variable will vary from trial to trial as the experiment is repeated. Suppose that we have a mapping rule Ω R such that we can associate to each experimental outcome ω a real value z(ω). We say that z is the value taken by the random variable z when the outcome of the random experiment is ω. Since there is a probability associated to each event E and we have a mapping from events to real values, a probability distribution can be associated to z.

34 Random variables

35 Random variables Definition Given a random experiment (Ω,{E}, Prob{ }), a random variable z is the result of a mapping that assigns a number z to every outcome ω. This mapping must satisfy the following two conditions but is otherwise arbitrary: the set {z z} is an event for every z. the probabilities Prob{z = } = 0 Prob{z = } = 0 Given a random variable z Z and a subset I Z we define the inverse mapping z 1 (I) = {ω Ω z(ω) I} where z 1 (I) {E} is an event and let Prob{z I} = Prob { z 1 (I) } = Prob{ω Ω z(ω) I}

36 Probabilistic interpretation of uncertainty This course will assume that the variability of measurements can be represented by the probability formalism. A random variable is a numerical quantity, linked to some experiment involving some degree of randomness, that takes its value from some set of possible real values. Example: the experiment might be the rolling of two six-sided dice and the r.v. z might be the sum of the two numbers showing in the dice. In this case the set of possible values are 2,...,12. In the example of the commute time, the random experiment is a compact (and approximate) way of modeling the disparate set of causes which led to variability in the value of z.

37 Probability function of a discrete r.v. The probability (mass) function of a discrete r.v. z is the combination of 1 the finite set Z of values that this r.v. can take (also called range or sample space) 2 the set of probabilities associated to each value of Z This means that we can attach to the random variable some specific mathematical function P(z) that gives for each z Z the probability that z assumes the value z P z (z) = Prob{z = z} This function must satify the two following conditions { P z (z) 0 z Z P z(z) = 1

38 Probability function of a discrete r.v.(ii) For a reduced number of possible values of z, the probability function can be presented in the form of a table. For example, if we plan to toss a fair coin twice, and the random variable z is the number of heads that eventually turn up, the probability function can be presented as follow Values of the random variable z Associated probabilities

39 Parametric probability function Suppose that 1 z is a discrete r.v. that takes its value in Z = {1, 2, 3}. 2 the probability function of z is P z (z) = θ 2z θ 2 +θ 4 +θ 6 where θ is some fixed non zero real number. Whatever the value of θ, P z (z) 0 for z = 1, 2, 3 and P z (1)+P z (2)+P z (3) = 1. Therefore z is a well-defined random variable, even if the value of θ is unknown. We call θ a parameter, that is some constant, usually unknown involved in a probability function. The collection of all probability distributions for different values of the parameter is called a family of probability distributions.

40 Expected value of a discrete r.v. Expected value of a discrete random variable z is defined as E[z] = µ = z Z zp z (z) the expected value (introduced first by Huygens in the seventeenth century) is a weighted average of the possible values that z could assume, where each value is weighted with the probability that z would assume the value in question. the expected value must not be confused with the "most probable value". The expected value is not necessarily a value that belongs to Z (i.e. the expected value of a die roll is 3.5) In English mean is used as a synonymous of "expected value". But the word average is NOT a synonymous of "expected value".

41 Substitution rule For any function g of the random variable z E[g(z)] = z Zg(z)P z (z) if z Z g(z) P z(z) < Note that in general E[g(z)] ge[z]. an exception if the linear function g(z) = az +b E[az+b] = ae[z]+b

42 Variance of a discrete r.v. Variance of a discrete random variable z is defined as Var[z] = σ 2 = E[(z E[z]) 2 ] = z Z(z E[z]) 2 P z (z) The variance is a measure of the dispersion of the probability function of the random variable around its mean. Note that since (z µ) 2 = z 2 2µz+µ 2 the following identity holds E[(z E[z]) 2 ] E[z 2 ] (E[z]) 2 = E[z 2 ] µ 2

43 Examples of probability functions P PP : :10 Two discrete r.v. probability functions having the same mean but different variance.

44 Std. deviation and moments of a discrete r.v. Standard deviation of a discrete random variable z is defined as the positive square root of the variance. Std[z] = Var[z] = σ Moment: for any positive integer r, the rth moment of the probability function is µ r = E[z r ] = z z r P z (z) Skewness of a discrete random variable z is defined as γ = E[(z µ)3 ] σ 3 Distributions with positive skewness have long tails to the right, and distributions with negative skewness have long tails to the left.

45 Joint probability Consider a probabilistic model described by n discrete random variables. A fully specified probabilistic model gives the joint probability function for every combination of the values of the n r.v.s. The model is specified by the values of the probabilities Prob{z 1 = z 1,z 2 = z 2,...,z n = z n } = P(z 1,z 2,...,z n ) for every possible assignment of values z 1,...,z n to the variables.

46 Independent variables Let x and y be two random variables. The two variables x and y are defined to be statistically independent if the joint probability Prob{x = x,y = y} = Prob{x = x} Prob{y = y} If two variables x and y are independent, then the transformed r.v. g(x) and h(y), where g and h are two given functions, are also independent. In qualitative terms, this means that we do not expect that the outcome of one variable will affect the other. Examples: think to two outcomes of a roulette wheel or to two coins tossed simultaneously.

47 Continuous random variable Continuous random variables take their value in some continuous range of values. Consider a real random variable z whose range is the set of real numbers. The following quantities can be defined: Definition The (cumulative) distribution function of z is the function F z (z) = Prob{z z} Definition The density function of a real random variable z is the derivative of the distribution function: p z (z) = df z(z) dz

48 Continuous random variable Any individual value has probability zero for a continuous random variable Probabilities of continuous r.v. are not allocated to specific values but rather to interval of values. Specifically Prob{a < z < b} = b a p z (z)dz, Z p z (z)dz = 1

49 Mean, variance,... of a continuous r.v. Consider a continuous scalar r.v. having range (l,h) and density function p(z). We can define Expectation (mean): Variance: σ 2 = µ = h h Other quantities of interest are the moments : µ r = E[z r ] = l h l l zp(z)dz (z µ) 2 p(z)dz z r p(z)dz The moment of order r = 1 is the mean of z.

50 Uniform distribution A random variable z is said to be uniformly distributed on the interval (a,b) (also z U(a,b)) if its probability density function is given by p(z) = { 1 b a if a < z < b 0, otherwise p(z) 1 b-a TP: compute the variance of U(a,b). a b z

51 Normal distribution: the scalar case A continuous scalar random variable x is said to be normally distributed with parameters µ and σ 2 (also x N(µ,σ 2 )) if its probability density function is given by p x (x) = 1 e (x µ)2 2σ 2 2πσ The mean of x is µ; the variance of x is σ 2. The coefficient in front of the exponential ensures that p(x)dx = 1. The probability that an observation x from a normal r.v. is within 2 standard deviations from the mean is If µ = 0 and σ 2 = 1 the distribution is defined standard normal. We will denote its distribution function F z (z) = Φ(z). Given a normal r.v. x N(µ,σ 2 ), the r.v. z = (x µ)/σ has a standard normal distribution.

52 Standard distribution

53 Important relations x N(µ,σ 2 ) Prob{µ σ x µ+σ} Prob{µ 1.282σ x µ+1.282σ} 0.8 Prob{µ 1.645σ x µ+1.645σ} 0.9 Prob{µ 1.96σ x µ+1.96σ} 0.95 Prob{µ 2σ x µ+2σ} Prob{µ 2.57σ x µ+2.57σ} 0.99 Prob{µ 3σ x µ+3σ} Test yourself these relations by random sampling and simulation using R!

54 Linear combinations The expectation value of a linear combination of r.v. s is simply the linear combination of their respective expectation values E[ax+by] = ae[x]+be[y] i.e., expectation is a linear statistic. Since the variance is not a linear statistic, we have Var[ax+by] = a 2 Var[x]+b 2 Var[y]+2ab(E[xy] E[x]E[y]) = where = a 2 Var[x]+b 2 Var[y]+2abCov[x,y] Cov[x,y] = E [(x E[x])(y E[y])] = E[xy] E[x]E[y] is called covariance. Covariance measure whether the two variables vary simultaneously in the same way around their average.

55 Covariance example y = 3 y = 10 y = 20 Marginal x = P(x = 10) = 0.25 x = P(x = 20) = 0.4 x = P(x = 30) = 0.35 P(y = 3) = 0.3 P(y = 10) = 0.4 P(y = 20) = 0.3 Sum=1 E[x] = 21, Var[x] = 59, E[y] = 10.9, Var[y] = 43.89, xy P(xy = xy) E[xy] = 211 Cov[x,y] = = 17.9, ρ(x,y) = 0.35

56 Correlation The correlation coefficient is ρ(x,y) = Cov[x, y] Var[x]Var[y] It is easily shown that 1 ρ(x, y) 1. Two r.v. are called uncorrelated if E[xy] = E[x]E[y] If x and y are two independent random variables then Cov[x,y] = 0 or equivalently E[xy] = E[x]E[y]. If x and y are two independent random variables then also Cov[g(x),h(y)] = 0 or equivalently E[g(x)h(y)] = E[g(x)]E[h(y)]. Independence Uncorrelation but not viceversa for a generic distribution. Independence Uncorrelation if x and y are jointly gaussian.

57 Correlation and causation The existence of a correlation different from zero does not necessarily mean that there is a causality relationship. Think to these examples Amount of Cokes drunk per day and the sport performance Sleeping with shoes on and headache The amount of firemen and the gravity of the disaster Taking an expensive drug and cancer risk. According to Tufte, Empirically observed covariation is a necessary but not sufficient condition four causality

58 Linear combination of independent vars If the random variables x and y are independent, then Var[x+y] = Var[x]+Var[y] and Var[ax+by] = a 2 Var[x]+b 2 Var[y] In general, if the random variables x 1,x 2,...,x k are independent, then [ k ] k Var c i x i = ci 2 σi 2 i=1 i=1

59 TP 1 Let x and y two discrete independent r.v. such that and P x ( 1) = 0.1, P x (0) = 0.8, P x (1) = 0.1 P y (1) = 0.1, P y (2) = 0.8, P y (3) = 0.1 If z = x+y show that E[z] = E[x]+E[y] 2 Let x be a discrete r.v. which assumes { 1, 0, 1} with probability 1/3 and y = x 2. 1 Let z = x+y. Show that E[z] = E[x]+E[y] 2 Demonstrate that x and y are uncorrelated but dependent random variables.

60 The sum of i.i.d. random variables Suppose that z 1,z 2,...,z N are i.i.d. (identically and independently distributed) random variables, discrete or continuous, each having a probability distribution with mean µ and variance σ 2. Let us consider the two derived r.v., that is the sum and the average S N = z 1 + z 2 + +z N z = z 1 + z 2 + +z N N The following relations hold E[S N ] = Nµ, Var[S N ] = Nσ 2 E[ z] = µ, See the R script sum_rv.r. Var[ z] = σ2 N

61 Normal distribution: the multivariate case Let z be a random vector (n 1). The vector is said to be normally distributed with parameters µ (n 1) and Σ (n n) (also z N(µ,Σ)) if its probability density function is given by p z (z) = It follows that 1 ( 2π) n det(σ) exp { 1 2 (z µ)t Σ 1 (z µ) the mean E[z] = µ = [µ 1,...,µ n ] T is a [n, 1]-dimensional vector, where µ i = E[z i ], i = 1,...,n, the [n, n] matrix Σ (n n) = E[(z µ)(z µ) T ] = } σ1 2 σ σ 1n σ 1n σ 2n... σn 2 is the covariance matrix where σ 2 i = Var[z i ] and σ ij = Cov[z i,z j ]. This matrix is squared and symmetric. It has n(n + 1)/2 parameters.

62 Normal multivariate distribution (II) The quantity = (z µ) T Σ 1 (z µ) which appears in the exponent of p z is called the Mahalanobis distance from z to µ. It can be shown that the surfaces of constant probability density are hyperellipsoids on which 2 is constant; their principal axes are given by the eigenvectors u i, i = 1,...,n of Σ which satisfy Σu i = λ i u i i = 1,...,n where λ i are the corresponding eigenvalues. the eigenvalues λ i give the variances along the principal directions.

63 Normal multivariate distribution (III) If the covariance matrix Σ is diagonal then the contours of constant density are hyperellipsoids with the principal directions aligned with the coordinate axes. the components of z are then statistically independent since the distribution of z can be written as the product of the distributions for each of the components separately in the form p z (z) = n p(z i ) the total number of independent parameters in the distribution is 2n. if σ i = σ for all i, the contours of constant density are hyperspheres. i=1

64 Bivariate normal distribution Consider a bivariate normal density whose mean is µ = [µ 1,µ 2 ] T and the covariance matrix is [ ] σ 2 Σ = 1 σ 12 σ 21 σ2 2 The correlation coefficient is ρ = σ 12 σ 1 σ 2 It can be shown that the general bivariate normal density has the form 1 p(z 1,z 2 ) = 2πσ 1 σ 2 1 ρ 2 [ [ (z1 ) 2 ( )( ) ( ) ]] 2 1 µ 1 z1 µ 1 z2 µ 2 z2 µ 2 exp 2(1 ρ 2 2ρ + ) σ 1 σ 1 σ 2 σ 2

65 Bivariate normal distribution Let Σ = [1.2919, ; , ] p(z 1,z 2 ) z z 1

66 Bivariate normal distribution (prj) z 2 u 1 u 2 λ 1 λ 2 z 1 See the R script s_gaussxyz.r.

67 Marginal and conditional distributions One of the important properties of the multivariate normal density is that all conditional and marginal probabilities are also normal. Using the relation p(z 2 z 1 ) = p(z 1,z 2 ) p(z 1 ) we find that p(z 2 z 1 ) is a normal distribution N(µ 2 1,σ2 1 2 ), where Note that µ 2 1 = µ 2 +ρ σ 2 σ 1 (z 1 µ 1 ) σ = σ2 2 (1 ρ2 ) µ 2 1 is a linear function of z 1 : if the correlation coefficient ρ is positive, the larger z 1, the larger µ 2 1. if there is no correlation between z 1 and z 2, we can ignore the value of z 1 to estimate µ 2.

68 rho= ; Var[z2]= 1.05 ; Var[z2 z1]= 0.19 z p(z1) Z Z z1 Z1 p(z2) p(z2 z1=0)

69 The central limit theorem Theorem Assume that z 1,z 2,...,z N are i.i.d. random variables, discrete or continuous, each having a probability distribution with finite mean µ and finite variance σ 2. As N, the standardized random variable ( z µ) N σ which is identical to (S N Nµ) Nσ converges in distribution to a r.v. having the standardized normal distribution N(0, 1). This result holds regardless of the common distribution of z i. This theorem justifies the importance of the normal distribution, since many r.v. of interest are either sums or averages. See R script central.r.

70 The chi-squared distribution For a N positive integer, a r.v. z has a χ 2 N distribution if z = x x 2 N where x 1,x 2,...,x N are i.i.d. random variables N(0, 1). The probability distribution is a gamma distribution with parameters ( 1 2 N, 1 2 ). E[z] = N and Var[z] = 2N. The distribution is called a chi-squared distribution with N degrees of freedom.

71 The chi-squared distribution (II) 0.1 χ 2 N density: N=10 1 χ 2 N cumulative distribution: N= R script chisq.r.

72 Student s t-distribution If x N(0, 1) and y χ 2 N are independent then the Student s t-distribution with N degrees of freedom is the distribution of the r.v. z = x y/n We denote this with z T N.

73 Student s t-distribution 0.4 Student density: N=10 1 Student cumulative distribution: N= R script s_stu.r.

74 Notation In order to clarify the distinction between random variables and their values, we will use the boldface notation for denoting a random variable (e.g. z) and the normal face notation for the eventually observed value (e.g. z = 11). The notation P z (z) denotes the probability that the random variable z take the value z. The suffix indicates that the probability relates to the random variable z. This is necessary since we often discuss probabilities associated with several random variables simultaneously. Example: z could be the age of a student before asking and z = 22 could be his value after the observation.

75 Notation (II) In general terms, we will denote as the probability distribution of a random variable z any complete description of the probabilistic behavior of z. For example, if z is continuous, the density function p(z) or the distribution function could be examples of probability distribution. Given a probability distribution F z (z) the notation F {z 1,z 2,...,z N } means that the dataset D N = {z 1,z 2,...,z N } is a i.i.d. random sample observed from the probability distribution F z ( ).

Modèles stochastiques II

Modèles stochastiques II INFO 154 Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 1 http://ulbacbe/di Modéles stochastiques II p1/50 The basics of statistics Statistics starts ith