CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Size: px

Start display at page:

Download "CS 540: Machine Learning Lecture 2: Review of Probability & Statistics"

Stewart Todd
5 years ago
Views:

1 CS 540: Machine Learning Lecture 2: Review of Probability & Statistics AD January 2008 AD () January / 35

2 Outline Probability theory (PRML, Section 1.2) Statistics (PRML, Sections ) AD () January / 35

3 Probability Basics Begin with a set Ω-the sample space; e.g. 6 possible rolls of a die. ω 2 Ω is a sample point/possible event. A probability space or probability model is a sample space with an assignment P (ω) for every ω 2 Ω s.t. 0 P (ω) 1 and P (ω) = 1; ω2ω e.g. for a die P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 1/6. An event A is any subset of Ω P (A) = P (ω) ω2a e.g. P (die roll < 3) = P (1) + P (2) = 1/6 + 1/6 = 1/3. AD () January / 35

4 Random Variables A random variable is loosely speaking a function from sample points to some range; e.g. the reals. P induces a probability distribution for any r.v. X : P (X = x) = P (ω) fω2ω:x (ω)=x g e.g. P (Odd = true) = P (1) + P (3) + P (5) = 1/6 + 1/6 + 1/6 = 1/2. AD () January / 35

5 Why use probability? Probability theory is nothing but common sense reduced to calculation Pierre Laplace, In 1931, de Finetti proved that it is irrational to have beliefs that violate these axioms, in the following sense: If you bet in accordance with your beliefs, but your beliefs violate the axioms, then you can be guaranteed to lose money to an opponent whose beliefs more accurately re ect the true state of the world. (Here, betting and money are proxies for decision making and utilities.) What if you refuse to bet? This is like refusing to allow time to pass: every action (including inaction) is a bet. Various other arguments (Cox, Savage). AD () January / 35

6 Probability for continuous variables We typically do not work with probability distributions but with densities. We have so We have Z P (X 2 A) = Z Ω A p X (x) dx. p X (x) dx = 1 P (X 2 [x, x + δx]) p X (x) δx Warning: A density evaluated at a point can be bigger than 1. AD () January / 35

7 Univariate Gaussian (Normal) density If X N µ, σ 2 then the pdf of X is de ned as p X (x) =! 1 p exp (x µ) 2 2πσ 2 2σ 2 We often use the precision λ = 1/σ 2 instead of the variance. AD () January / 35

8 Statistics of a distribution Recall that the mean and variance of a distribution are de ned as Z E (X ) = xp X (x) dx V (X ) = E (X E (X )) 2 = Z (x E (X )) 2 p X (x) dx For a Gaussian distribution, we have Chebyshev s inequality E (X ) = µ, V (X ) = σ 2. P (jx E (X )j ε) V (X ) ε 2. AD () January / 35

9 Cumulative distribution function The cumulative distribution function is de ned as We have F X (x) = P (X x) = Z x lim X (x) x! = 0, lim X (x) x! = 1. p X (u) du AD () January / 35

10 Law of large numbers Assume you have X i i.i.d. p X () then n Z 1 lim n! n ϕ (X i ) = E [ϕ (X )] = ϕ (x) p X (x) dx The result is still valid for weakly dependent random variables. AD () January / 35

11 Central limit theorem i.i.d. Let X i p X () such that E [X i ] = µ and V (X i ) = σ 2 then! p n 1 D n n X i µ! N 0, σ 2. This result is (too) often used to justify the Gaussian assumption AD () January / 35

12 Delta method Assume we have! p n 1 n n X i µ D! N 0, σ 2 Then we have! p n 1 n g n X i g (µ)! D! N 0, g 0 (µ) 2 σ 2 This follows from the Taylor s expansion g! n 1 n X i g (µ) + g 0 (µ) 1 n! n X i µ + o 1 n AD () January / 35

13 Prior probability Consider the following random variables Sick2 fyes,nog and Weather2 fsunny,rain,cloudy,snowg We can set a prior probability on these random variables; e.g. P (Sick = yes) = 1 P (Sick = no) = 0.1, P (Weather) = (0.72, 0.1, 0.08, 0.1) Obviously we cannot answer questions like P (Sick = yesj Weather = rainy) as long as we don t de ne an appropriate distribution. AD () January / 35

14 Joint probability distributions We need to assign a probability to each sample point For (Weather, Sick), we have to consider 4 2 events; e.g. Weather / Sick sunny rainy cloudy snow yes no From this probability distribution, we can compute probabilities of the form P (Sick = yesj Weather = rainy). AD () January / 35

15 Bayes rule We have It follows that P (x, y) = P (xj y) P (y) = P (yj x) P (x) P (xj y) = = P (x, y) P (y) P (yj x) P (x) x P (yj x) P (x) We have P (Sick = yesj Weather = rainy) = = P (Sick = yes,weather = rainy) P (Weather = rainy) = 0.2 AD () January / 35

16 Example Useful for assessing diagnostic prob from causal prob P (Causej E ect) = P (E ectj Cause) P (Cause) P (E ect) Let M be meningitis, S be sti neck: P (mj s) = P (sj m) P (m) P (s) = = Be careful to subtle exchanging of P (Aj B) for P (Bj A). AD () January / 35

17 Example Prosecutor s Fallacy. A zealous prosecutor has collected an evidence and has an expert testify that the probability of nding this evidence if the accused were innocent is one-in-a-million. The prosecutor concludes that the probability of the accused being innocent is one-in-a-million. This is WRONG. AD () January / 35

18 Assume no other evidence is available and the population is of 10 million people. De ning A = The accused is guilty then P (A) = De ning B = Finding this evidence then P (Bj A) = 1 & P Bj A = Bayes formula yields P (Aj B) = = P (Bj A) P (A) P (Bj A) P (A) + P Bj A P A ( ) 0.1. Real-life Example: Sally Clark was condemned in the UK (The RSS pointed out the mistake). Her conviction was eventually quashed (on other grounds). AD () January / 35

19 You feel sick and your GP thinks you might have contracted a rare disease (0.01% of the population has the disease). A test is available but not perfect. If a tested patient has the disease, 100% of the time the test will be positive. If a tested patient does not have the diseases, 95% of the time the test will be negative (5% false positive). Your test is positive, should you really care? AD () January / 35

20 Let A be the event that the patient has the disease and B be the event that the test returns a positive result P (Aj B) = Such a test would be a complete waste of money for you or the National Health System. A similar question was asked to 60 students and sta at Harvard Medical School: 18% got the right answer, the modal response was 95%! AD () January / 35

21 General Bayesian framework Assume you have some unknown parameter θ and some data D. The unknown parameters are now considered RANDOM of prior density p (θ) The likelihood of the observation D is p (Dj θ). The posterior is given by p ( θj D) = p (Dj θ) p (θ) p (D) where Z p (D) = p (Dj θ) p (θ) dθ p (D) is called the marginal likelihood or evidence. AD () January / 35

22 Bayesian interpretation of probability In the Bayesian approach, probability describes degrees of belief, instead of limiting frequencies. The selection of a prior has an obvious impact on the inference results! However, Bayesian statisticians are honest about it. AD () January / 35

23 Bayesian inference involves passing from a prior p (θ) to a posterior p ( θj D).We might expect that because the posterior incorporates the information from the data, it will be less variable than the prior. We have the following identities E [θ] = E [E [ θj D]], V [θ] = E [V [ θj D]] + V [E [ θj D]]. It means that, on average (over the realizations of the data X ) we expect the conditional expectation E [ θj D] to be equal to E [θ] and the posterior variance to be on average smaller than the prior variance by an amount that depend on the variations in posterior means over the distribution of possible data. AD () January / 35

24 If (θ, X ) are two scalar random variables then we have Proof: V [θ] = E [V [ θj D]] + V [E [ θj D]]. V [θ] = E θ 2 E (θ) 2 = E E θ 2 D (E (E ( θj D))) 2 = E E θ 2 D E (E ( θj D)) 2 +E (E ( θj D)) 2 (E (E ( θj D))) 2 = E (V ( θj X )) + V (E ( θj X )). Such results appear attractive but one should be careful. Here there is an underlying assumption that the observations are indeed distributed according to p (D). AD () January / 35

25 Frequentist interpretation of probability Probabilities are objective properties of the real world, and refer to limiting relative frequencies (e.g. number of times I have observed heads.). Problem. How do you attribute a probability to the following event There will be a major earthquake in Tokyo on the 27th April 2013? Hence in most cases θ is assumed unknown but xed. We then typically compute point estimates of the form bθ = ϕ (D) which are designed to have various desirable quantities when averaged over future data D 0 (assumed to be drawn from the true distribution). AD () January / 35

26 Maximum Likelihood Estimation Suppose we have X i i.i.d. p ( j θ) for i = 1,..., N then The MLE is de ned as L (θ) = p (Dj θ) = N bθ = arg max L (θ) = arg max l (θ) p (x i j θ). where l (θ) = log L (θ) = N log p (x i j θ). AD () January / 35

27 MLE for 1D Gaussian Consider X i i.i.d. N µ, σ 2 then so p Dj µ, σ 2 = N N x i ; µ, σ 2 l (θ) = N N 2 log (2π) N 2 log σ2 1 2σ 2 (x i µ) 2 Setting l(θ) µ = l(θ) = 0 yields σ 2 bµ ML = 1 N N x i, N cσ 2 ML = 1 (x i bµ N ML ) 2. h i We have E [bµ ML ] = µ and E cσ 2 ML = N N 1 σ2. AD () January / 35

28 Consistent estimators De nition. A sequence of estimators bθ N = bθ N (X 1,..., X N ) is consistent for the parameter θ if, for every ε > 0 and every θ 2 Θ lim P θ bθ N θ < ε = 1 (equivalently lim P θ bθ N θ ε = 0). N! N! i.i.d. Example: Consider X i N (µ, 1) and bµ N = 1 N N X i then bµ N N µ, N 1 and P θ bθ N θ < ε = Z ε p N ε p N 1 p 2π exp u 2 2 du! 1. It is possible to avoid this calculations and use instead Chebychev s inequality b 2 E P θ bθ N θ ε = P θ bθ n θ 2 θ θ n θ ε 2 where E θ b θ n 2 2 θ = var θ b θ n + E θ b θ n θ. AD () January / 35 ε 2

29 Example of inconsistent MLE (Fisher) (X i, Y i ) N µi µ i The likelihood function is given by L (θ) = We obtain 1 (2πσ 2 ) n exp l (θ) = cste n log σ 2 " n 1 xi + y i 2σ We have bµ i = x i + y i 2 σ 2 0, 0 σ 2. n 1 h 2σ 2 (x i µ i ) 2 + (y i µ i ) 2i! µ i , c σ 2 = n (x i y i ) 2 4n # n (x i y i ) 2.! σ2 2. AD () January / 35

30 Consistency of the MLE Kullback-Leibler Distance: For any density f, g Z f (x) D (f, g) = f (x) log dx g (x) We have Indeed Z D (f, g) = D (f, g) 0 and D (f, f ) = 0. f (x) log Z g (x) dx f (x) g (x) f (x) f (x) 1 dx = 0 D (f, g) is a very useful distance and appears in many di erent contexts. AD () January / 35

31 Sketch of consistency of the MLE Assume the pdfs f (xj θ) have common support for all θ and p (xj θ) 6= p xj θ 0 for θ 6= θ 0 ; i.e. S θ = fx : p (xj θ) > 0g is independent of θ. Denote M N (θ) = 1 N N log p (X i j θ) p (X i j θ ) As the MLE bθ n maximizes L (θ), it also maximizes M N (θ). i.i.d. Assume X i p ( j θ ). Note that by the law of large numbers M N (θ) converges to Z p (X j θ) p (xj θ) E θ log = p (xj θ ) log p (X j θ ) p (xj θ ) dx = D (p ( j θ ), p ( j θ)) := M (θ). Hence, M N (θ) D (p ( j θ ), p ( j θ)) which is maximized for θ so we expect that its maximizer will converge towards θ. AD () January / 35

32 Asymptotic Normality Assuming bθ N is a consistent estimate of θ, we have under regularity assumptions p N b θ N θ ) N 0, [I (θ 1 )] where We can estimate I (θ ) through [I (θ)] k,l = 2 log p (xj θ) E θ k θ l [I (θ )] k,l = 1 N N 2 log p (X i j θ) θ k θ l b θ N AD () January / 35

33 Sketch of the proof We have 0 = l 0 bθ N l 0 (θ ) + b θ N ) p N b θ N θ = 1 p N l 0 (θ ) 1 N l 00 (θ ) θ l 00 (θ ) h Now l 0 (θ ) = n log p( X i jθ) θ where E log p( Xi jθ) θ θ θ h i log p( Xi jθ) var θ θ = I (θ ) so the CLT tells us that θ where by the law of large numbers 1 N l 00 (θ ) = 1 p N l 0 (θ ) D! N (0, I (θ )) 1 N N 2 log p (X i j θ) θ k θ l θ! I (θ ). θ i = 0 and AD () January / 35

34 Frequentist con dence intervals The MLE being approximately Gaussian, it is possible to de ne (approximate) con dence intervals. However be careful, θ has a 95% con dence interval of [a, b] does NOT mean that P ( θ 2 [a, b]j D) = It means P D 0 P ( jd ) θ 2 a D 0, b D 0 = 0.95 i.e., if we were to repeat the experiment, then 95% of the time, the true parameter would be in that interval. This does not mean that we believe it is in that interval given the actual data D we have observed. AD () January / 35

35 Bayesian statistics as an unifying approach In my opinion, the Bayesian approach is much simpler and more natural than classical statistics. Once you have de ned a Bayesian model, everything follows naturally. De ning a Bayesian model requires specifying a prior distribution, but this assumption is clearly stated. For many years, Bayesian statistics could not be implemented for all but the simple models but now numerous algorithms have been proposed to perform approximate inference. AD () January / 35

Machine Learning 4771

Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood