CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

Similar documents
Machine Learning 4771

Density Estimation. Seungjin Choi

Recitation 2: Probability

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

CSC321 Lecture 18: Learning Probabilistic Models

COMP90051 Statistical Machine Learning

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Parametric Techniques Lecture 3

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning. Lecture 2

CS 188: Artificial Intelligence Spring Today

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Probability, Statistics, and Bayes Theorem Session 3

2. A Basic Statistical Toolbox

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Statistics: Learning models from data

GAUSSIAN PROCESS REGRESSION

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

L2: Review of probability and statistics

Bayesian Inference. Introduction

Parametric Techniques

Probability and Information Theory. Sargur N. Srihari

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

COMP9414/ 9814/ 3411: Artificial Intelligence. 14. Uncertainty. Russell & Norvig, Chapter 13. UNSW c AIMA, 2004, Alan Blair, 2012

Brief Review of Probability

Lecture Stat Information Criterion

Probability Theory for Machine Learning. Chris Cremer September 2015

V7 Foundations of Probability Theory

Review: Statistical Model

STATISTICS/ECONOMETRICS PREP COURSE PROF. MASSIMO GUIDOLIN

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Parameter estimation Conditional risk

ECE531 Lecture 10b: Maximum Likelihood Estimation

Choosing among models

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Grundlagen der Künstlichen Intelligenz

Introduction to Machine Learning

Testing Restrictions and Comparing Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture : Probabilistic Machine Learning

Introduction to Probability and Statistics (Continued)

Artificial Intelligence

Combining Macroeconomic Models for Prediction

Probabilistic and Bayesian Machine Learning

Discrete Probability and State Estimation

Uncertainty. Chapter 13

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

ECE521 week 3: 23/26 January 2017

Foundations of Statistical Inference

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Session 2B: Some basic simulation methods

Lecture Stat Bayesian Statistics

Some Probability and Statistics

Probability and Estimation. Alan Moses

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 2: Basic Concepts of Statistical Decision Theory

ECE531: Principles of Detection and Estimation Course Introduction

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Parametric Models: from data to models

Pengju

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Review of probabilities

Outline. Uncertainty. Methods for handling uncertainty. Uncertainty. Making decisions under uncertainty. Probability. Uncertainty

Some Probability and Statistics

Association studies and regression

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Lecture 2: Review of Basic Probability Theory

Bayesian Linear Regression [DRAFT - In Progress]

Lecture 1: Bayesian Framework Basics

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Probability Theory Review

Uncertainty. Outline. Probability Syntax and Semantics Inference Independence and Bayes Rule. AIMA2e Chapter 13

Discrete Probability and State Estimation

Parameter Estimation. Industrial AI Lab.

Review (Probability & Linear Algebra)

PATTERN RECOGNITION AND MACHINE LEARNING

Primer on statistics:

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

Probabilistic Reasoning

Probability and statistics; Rehearsal for pattern recognition

Time Series and Dynamic Models

Probability theory basics

Probability Based Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Refresher on Discrete Probability

Discrete Structures Prelim 1 Selected problems from past exams

Introduction to Machine Learning

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Lecture 7 Introduction to Statistical Decision Theory

Transcription:

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics AD January 2008 AD () January 2008 1 / 35

Outline Probability theory (PRML, Section 1.2) Statistics (PRML, Sections 2.1-2.4) AD () January 2008 2 / 35

Probability Basics Begin with a set Ω-the sample space; e.g. 6 possible rolls of a die. ω 2 Ω is a sample point/possible event. A probability space or probability model is a sample space with an assignment P (ω) for every ω 2 Ω s.t. 0 P (ω) 1 and P (ω) = 1; ω2ω e.g. for a die P (1) = P (2) = P (3) = P (4) = P (5) = P (6) = 1/6. An event A is any subset of Ω P (A) = P (ω) ω2a e.g. P (die roll < 3) = P (1) + P (2) = 1/6 + 1/6 = 1/3. AD () January 2008 3 / 35

Random Variables A random variable is loosely speaking a function from sample points to some range; e.g. the reals. P induces a probability distribution for any r.v. X : P (X = x) = P (ω) fω2ω:x (ω)=x g e.g. P (Odd = true) = P (1) + P (3) + P (5) = 1/6 + 1/6 + 1/6 = 1/2. AD () January 2008 4 / 35

Why use probability? Probability theory is nothing but common sense reduced to calculation Pierre Laplace, 1812. In 1931, de Finetti proved that it is irrational to have beliefs that violate these axioms, in the following sense: If you bet in accordance with your beliefs, but your beliefs violate the axioms, then you can be guaranteed to lose money to an opponent whose beliefs more accurately re ect the true state of the world. (Here, betting and money are proxies for decision making and utilities.) What if you refuse to bet? This is like refusing to allow time to pass: every action (including inaction) is a bet. Various other arguments (Cox, Savage). AD () January 2008 5 / 35

Probability for continuous variables We typically do not work with probability distributions but with densities. We have so We have Z P (X 2 A) = Z Ω A p X (x) dx. p X (x) dx = 1 P (X 2 [x, x + δx]) p X (x) δx Warning: A density evaluated at a point can be bigger than 1. AD () January 2008 6 / 35

Univariate Gaussian (Normal) density If X N µ, σ 2 then the pdf of X is de ned as p X (x) =! 1 p exp (x µ) 2 2πσ 2 2σ 2 We often use the precision λ = 1/σ 2 instead of the variance. AD () January 2008 7 / 35

Statistics of a distribution Recall that the mean and variance of a distribution are de ned as Z E (X ) = xp X (x) dx V (X ) = E (X E (X )) 2 = Z (x E (X )) 2 p X (x) dx For a Gaussian distribution, we have Chebyshev s inequality E (X ) = µ, V (X ) = σ 2. P (jx E (X )j ε) V (X ) ε 2. AD () January 2008 8 / 35

Cumulative distribution function The cumulative distribution function is de ned as We have F X (x) = P (X x) = Z x lim X (x) x! = 0, lim X (x) x! = 1. p X (u) du AD () January 2008 9 / 35

Law of large numbers Assume you have X i i.i.d. p X () then n Z 1 lim n! n ϕ (X i ) = E [ϕ (X )] = ϕ (x) p X (x) dx The result is still valid for weakly dependent random variables. AD () January 2008 10 / 35

Central limit theorem i.i.d. Let X i p X () such that E [X i ] = µ and V (X i ) = σ 2 then! p n 1 D n n X i µ! N 0, σ 2. This result is (too) often used to justify the Gaussian assumption AD () January 2008 11 / 35

Delta method Assume we have! p n 1 n n X i µ D! N 0, σ 2 Then we have! p n 1 n g n X i g (µ)! D! N 0, g 0 (µ) 2 σ 2 This follows from the Taylor s expansion g! n 1 n X i g (µ) + g 0 (µ) 1 n! n X i µ + o 1 n AD () January 2008 12 / 35

Prior probability Consider the following random variables Sick2 fyes,nog and Weather2 fsunny,rain,cloudy,snowg We can set a prior probability on these random variables; e.g. P (Sick = yes) = 1 P (Sick = no) = 0.1, P (Weather) = (0.72, 0.1, 0.08, 0.1) Obviously we cannot answer questions like P (Sick = yesj Weather = rainy) as long as we don t de ne an appropriate distribution. AD () January 2008 13 / 35

Joint probability distributions We need to assign a probability to each sample point For (Weather, Sick), we have to consider 4 2 events; e.g. Weather / Sick sunny rainy cloudy snow yes 0.144 0.02 0.016 0.02 no 0.576 0.08 0.064 0.08 From this probability distribution, we can compute probabilities of the form P (Sick = yesj Weather = rainy). AD () January 2008 14 / 35

Bayes rule We have It follows that P (x, y) = P (xj y) P (y) = P (yj x) P (x) P (xj y) = = P (x, y) P (y) P (yj x) P (x) x P (yj x) P (x) We have P (Sick = yesj Weather = rainy) = = P (Sick = yes,weather = rainy) P (Weather = rainy) 0.02 0.02 + 0.08 = 0.2 AD () January 2008 15 / 35

Example Useful for assessing diagnostic prob from causal prob P (Causej E ect) = P (E ectj Cause) P (Cause) P (E ect) Let M be meningitis, S be sti neck: P (mj s) = P (sj m) P (m) P (s) = 0.8 0.0001 0.1 = 0.0008 Be careful to subtle exchanging of P (Aj B) for P (Bj A). AD () January 2008 16 / 35

Example Prosecutor s Fallacy. A zealous prosecutor has collected an evidence and has an expert testify that the probability of nding this evidence if the accused were innocent is one-in-a-million. The prosecutor concludes that the probability of the accused being innocent is one-in-a-million. This is WRONG. AD () January 2008 17 / 35

Assume no other evidence is available and the population is of 10 million people. De ning A = The accused is guilty then P (A) = 10 7. De ning B = Finding this evidence then P (Bj A) = 1 & P Bj A = 10 6. Bayes formula yields P (Aj B) = = P (Bj A) P (A) P (Bj A) P (A) + P Bj A P A 10 7 10 7 + 10 6 (1 10 7 ) 0.1. Real-life Example: Sally Clark was condemned in the UK (The RSS pointed out the mistake). Her conviction was eventually quashed (on other grounds). AD () January 2008 18 / 35

You feel sick and your GP thinks you might have contracted a rare disease (0.01% of the population has the disease). A test is available but not perfect. If a tested patient has the disease, 100% of the time the test will be positive. If a tested patient does not have the diseases, 95% of the time the test will be negative (5% false positive). Your test is positive, should you really care? AD () January 2008 19 / 35

Let A be the event that the patient has the disease and B be the event that the test returns a positive result P (Aj B) = 1 0.0001 1 0.0001 + 0.05 0.9999 0.002 Such a test would be a complete waste of money for you or the National Health System. A similar question was asked to 60 students and sta at Harvard Medical School: 18% got the right answer, the modal response was 95%! AD () January 2008 20 / 35

General Bayesian framework Assume you have some unknown parameter θ and some data D. The unknown parameters are now considered RANDOM of prior density p (θ) The likelihood of the observation D is p (Dj θ). The posterior is given by p ( θj D) = p (Dj θ) p (θ) p (D) where Z p (D) = p (Dj θ) p (θ) dθ p (D) is called the marginal likelihood or evidence. AD () January 2008 21 / 35

Bayesian interpretation of probability In the Bayesian approach, probability describes degrees of belief, instead of limiting frequencies. The selection of a prior has an obvious impact on the inference results! However, Bayesian statisticians are honest about it. AD () January 2008 22 / 35

Bayesian inference involves passing from a prior p (θ) to a posterior p ( θj D).We might expect that because the posterior incorporates the information from the data, it will be less variable than the prior. We have the following identities E [θ] = E [E [ θj D]], V [θ] = E [V [ θj D]] + V [E [ θj D]]. It means that, on average (over the realizations of the data X ) we expect the conditional expectation E [ θj D] to be equal to E [θ] and the posterior variance to be on average smaller than the prior variance by an amount that depend on the variations in posterior means over the distribution of possible data. AD () January 2008 23 / 35

If (θ, X ) are two scalar random variables then we have Proof: V [θ] = E [V [ θj D]] + V [E [ θj D]]. V [θ] = E θ 2 E (θ) 2 = E E θ 2 D (E (E ( θj D))) 2 = E E θ 2 D E (E ( θj D)) 2 +E (E ( θj D)) 2 (E (E ( θj D))) 2 = E (V ( θj X )) + V (E ( θj X )). Such results appear attractive but one should be careful. Here there is an underlying assumption that the observations are indeed distributed according to p (D). AD () January 2008 24 / 35

Frequentist interpretation of probability Probabilities are objective properties of the real world, and refer to limiting relative frequencies (e.g. number of times I have observed heads.). Problem. How do you attribute a probability to the following event There will be a major earthquake in Tokyo on the 27th April 2013? Hence in most cases θ is assumed unknown but xed. We then typically compute point estimates of the form bθ = ϕ (D) which are designed to have various desirable quantities when averaged over future data D 0 (assumed to be drawn from the true distribution). AD () January 2008 25 / 35

Maximum Likelihood Estimation Suppose we have X i i.i.d. p ( j θ) for i = 1,..., N then The MLE is de ned as L (θ) = p (Dj θ) = N bθ = arg max L (θ) = arg max l (θ) p (x i j θ). where l (θ) = log L (θ) = N log p (x i j θ). AD () January 2008 26 / 35

MLE for 1D Gaussian Consider X i i.i.d. N µ, σ 2 then so p Dj µ, σ 2 = N N x i ; µ, σ 2 l (θ) = N N 2 log (2π) N 2 log σ2 1 2σ 2 (x i µ) 2 Setting l(θ) µ = l(θ) = 0 yields σ 2 bµ ML = 1 N N x i, N cσ 2 ML = 1 (x i bµ N ML ) 2. h i We have E [bµ ML ] = µ and E cσ 2 ML = N N 1 σ2. AD () January 2008 27 / 35

Consistent estimators De nition. A sequence of estimators bθ N = bθ N (X 1,..., X N ) is consistent for the parameter θ if, for every ε > 0 and every θ 2 Θ lim P θ bθ N θ < ε = 1 (equivalently lim P θ bθ N θ ε = 0). N! N! i.i.d. Example: Consider X i N (µ, 1) and bµ N = 1 N N X i then bµ N N µ, N 1 and P θ bθ N θ < ε = Z ε p N ε p N 1 p 2π exp u 2 2 du! 1. It is possible to avoid this calculations and use instead Chebychev s inequality b 2 E P θ bθ N θ ε = P θ bθ n θ 2 θ θ n θ ε 2 where E θ b θ n 2 2 θ = var θ b θ n + E θ b θ n θ. AD () January 2008 28 / 35 ε 2

Example of inconsistent MLE (Fisher) (X i, Y i ) N µi µ i The likelihood function is given by L (θ) = We obtain 1 (2πσ 2 ) n exp l (θ) = cste n log σ 2 " n 1 xi + y i 2σ 2 2 2 We have bµ i = x i + y i 2 σ 2 0, 0 σ 2. n 1 h 2σ 2 (x i µ i ) 2 + (y i µ i ) 2i! µ i 2 + 1 2, c σ 2 = n (x i y i ) 2 4n # n (x i y i ) 2.! σ2 2. AD () January 2008 29 / 35

Consistency of the MLE Kullback-Leibler Distance: For any density f, g Z f (x) D (f, g) = f (x) log dx g (x) We have Indeed Z D (f, g) = D (f, g) 0 and D (f, f ) = 0. f (x) log Z g (x) dx f (x) g (x) f (x) f (x) 1 dx = 0 D (f, g) is a very useful distance and appears in many di erent contexts. AD () January 2008 30 / 35

Sketch of consistency of the MLE Assume the pdfs f (xj θ) have common support for all θ and p (xj θ) 6= p xj θ 0 for θ 6= θ 0 ; i.e. S θ = fx : p (xj θ) > 0g is independent of θ. Denote M N (θ) = 1 N N log p (X i j θ) p (X i j θ ) As the MLE bθ n maximizes L (θ), it also maximizes M N (θ). i.i.d. Assume X i p ( j θ ). Note that by the law of large numbers M N (θ) converges to Z p (X j θ) p (xj θ) E θ log = p (xj θ ) log p (X j θ ) p (xj θ ) dx = D (p ( j θ ), p ( j θ)) := M (θ). Hence, M N (θ) D (p ( j θ ), p ( j θ)) which is maximized for θ so we expect that its maximizer will converge towards θ. AD () January 2008 31 / 35

Asymptotic Normality Assuming bθ N is a consistent estimate of θ, we have under regularity assumptions p N b θ N θ ) N 0, [I (θ 1 )] where We can estimate I (θ ) through [I (θ)] k,l = 2 log p (xj θ) E θ k θ l [I (θ )] k,l = 1 N N 2 log p (X i j θ) θ k θ l b θ N AD () January 2008 32 / 35

Sketch of the proof We have 0 = l 0 bθ N l 0 (θ ) + b θ N ) p N b θ N θ = 1 p N l 0 (θ ) 1 N l 00 (θ ) θ l 00 (θ ) h Now l 0 (θ ) = n log p( X i jθ) θ where E log p( Xi jθ) θ θ θ h i log p( Xi jθ) var θ θ = I (θ ) so the CLT tells us that θ where by the law of large numbers 1 N l 00 (θ ) = 1 p N l 0 (θ ) D! N (0, I (θ )) 1 N N 2 log p (X i j θ) θ k θ l θ! I (θ ). θ i = 0 and AD () January 2008 33 / 35

Frequentist con dence intervals The MLE being approximately Gaussian, it is possible to de ne (approximate) con dence intervals. However be careful, θ has a 95% con dence interval of [a, b] does NOT mean that P ( θ 2 [a, b]j D) = 0.95. It means P D 0 P ( jd ) θ 2 a D 0, b D 0 = 0.95 i.e., if we were to repeat the experiment, then 95% of the time, the true parameter would be in that interval. This does not mean that we believe it is in that interval given the actual data D we have observed. AD () January 2008 34 / 35

Bayesian statistics as an unifying approach In my opinion, the Bayesian approach is much simpler and more natural than classical statistics. Once you have de ned a Bayesian model, everything follows naturally. De ning a Bayesian model requires specifying a prior distribution, but this assumption is clearly stated. For many years, Bayesian statistics could not be implemented for all but the simple models but now numerous algorithms have been proposed to perform approximate inference. AD () January 2008 35 / 35