Statistical Inference: Maximum Likelihood and Bayesian Approaches

Similar documents
Bayesian statistics, simulation and software

STAT 830 Bayesian Estimation

Bayes Testing and More

Foundations of Statistical Inference

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Chapter 9: Hypothesis Testing Sections

Minimum Message Length Analysis of the Behrens Fisher Problem

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Hypothesis Testing: The Generalized Likelihood Ratio Test

Bayesian Inference: Posterior Intervals

Introduction to Bayesian Methods

CSC321 Lecture 18: Learning Probabilistic Models

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

A Very Brief Summary of Bayesian Inference, and Examples

Two examples of the use of fuzzy set theory in statistics. Glen Meeden University of Minnesota.

Bayesian Learning (II)

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences FINAL EXAMINATION, APRIL 2013

Part III. A Decision-Theoretic Approach and Bayesian testing

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Beta statistics. Keywords. Bayes theorem. Bayes rule

General Bayesian Inference I

Introduction to Probabilistic Machine Learning

Parameter Estimation

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

PMR Learning as Inference

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Classical and Bayesian inference

Basic math for biology

Special Topic: Bayesian Finite Population Survey Sampling

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Statistics 3858 : Maximum Likelihood Estimators

Statistical learning. Chapter 20, Sections 1 3 1

Choosing among models

Compute f(x θ)f(θ) dθ

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Problem 1 (20) Log-normal. f(x) Cauchy

The devil is in the denominator

Econ 2148, spring 2019 Statistical decision theory

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Learning Bayesian network : Given structure and completely observed data

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Probability and Estimation. Alan Moses

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Risk Estimation and Uncertainty Quantification by Markov Chain Monte Carlo Methods

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Bayesian Methods: Naïve Bayes

STAT 730 Chapter 4: Estimation

Post-exam 2 practice questions 18.05, Spring 2014

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Chapter 4. Bayesian inference. 4.1 Estimation. Point estimates. Interval estimates

Bayesian inference: what it means and why we care

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Bayesian RL Seminar. Chris Mansley September 9, 2008

1 Hypothesis Testing and Model Selection

Mathematical statistics

Chapter 5. Bayesian Statistics

Lecture 2: Statistical Decision Theory (Part I)

Statistics: Learning models from data

Density Estimation: ML, MAP, Bayesian estimation

Foundations of Statistical Inference

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Lecture 4: Probabilistic Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Machine Learning 4771

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

CS340 Winter 2010: HW3 Out Wed. 2nd February, due Friday 11th February

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

GWAS IV: Bayesian linear (variance component) models

Foundations of Fuzzy Bayesian Inference

Lecture 1: Bayesian Framework Basics

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

CSC 2541: Bayesian Methods for Machine Learning

Statistical Data Analysis Stat 3: p-values, parameter estimation

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Bayesian Inference for Normal Mean

Classical Inference for Gaussian Linear Models

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Probability Theory for Machine Learning. Chris Cremer September 2015

Introduction to Bayesian Statistics. James Swain University of Alabama in Huntsville ISEEM Department

Mathematical statistics

Stat 5102 Final Exam May 14, 2015

Bayesian Approaches Data Mining Selected Technique

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Hierarchical Models & Bayesian Model Selection

Approximate Bayesian Computation for Astrostatistics

y Xw 2 2 y Xw λ w 2 2

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

A Bayesian Treatment of Linear Gaussian Regression

Naïve Bayes classification

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Statistical learning. Chapter 20, Sections 1 3 1

Transcription:

Statistical Inference: Maximum Likelihood and Bayesian Approaches Surya Tokdar

From model to inference So a statistical analysis begins by setting up a model {f (x θ) : θ Θ} for data X. Next we observe our actual data X = x. The pdfs/pmfs included in our model represent theories, the observations x is evidence. The goal of inference is to compare between these theories in light of the recorded evidence. 1 / 23

Example: Opinion poll n: # sampled = 500 X : # in favor Model: X Bin(n, p), p [0, 1]. Obs: X = 200. pmf 0.00 0.02 0.04 0.06 0.08 theories x obs 0 100 200 300 400 500 data 2 / 23

The likelihood function Clearly observed data will better match the prediction of some theories than others. In other words, some theories will better predict the particular observation than other theories. Better prediction means assigning a higher probability to observing X = x So we can assign scores to theories by this function of θ: L x (θ) = f (x θ), θ Θ This is called the likelihood score/function. 3 / 23

Some words on the likelihood function L x (θ) is a function of θ Θ. L x (θ 1 ) L x (θ 2 ) it depends on the observed data x, but for any single data analysis x is a fixed quantity. = 2 implies the observed data is two times more likely to appear under theory θ 1 than under theory θ 2. For all technical purposes, one can work with L x (θ) in the log-scale. That is, define the log-likelihood function l x (θ) = log L x (θ) = log f (x θ), θ Θ. Log-scale comparisons are done by l x (θ 1 ) l x (θ 2 ). 4 / 23

Opinion poll likelihood Model X Bin(n, p), p [0, 1]. So ( ) n L x (p) = p x (1 p) n x, p [0, 1] x and the log-likelihood function is ( ) n l x (p) = log + x log p + (n x) log(1 p), p [0, 1] x The first term on the r.h.s. does not involve p. So we write l x (p) = const + x log p + (n x) log(1 p), p [0, 1] and don t care about the exact value of const. Indeed, const disappears in differences l x (p 1 ) l x (p 2 ). 5 / 23

Graphs of likelihood and log-liklihood L x (p) 0.00 0.01 0.02 0.03 l x (p) -1000-600 -200 0 0.0 0.2 0.4 0.6 0.8 1.0 p 0.0 0.2 0.4 0.6 0.8 1.0 p 6 / 23

Learning from the likelihood function Two goals 1. To report a subset of attractive theories. 2. To test a scientific hypothesis θ Θ 0, a subset of Θ. These may not be the only/most important goals But capture the essence of inference We ll get into other goals later Two aproaches to use L x (θ) or l x (θ) to come up with and interpret such a subset 1. The maximum likelihood (ML) approach 2. The Bayesian approach 7 / 23

The ML approach Use L x (θ) to split the parameter space into two subsets 1. subset of well supported θ with high L x (θ) 2. subset of not-so-well-supported θ with low L x (θ). Can effect such a split by 1. fixing a k [0, 1] 2. setting the first set as { } A k (x) = θ Θ : L x (θ) k max L x ( θ). θ Θ Report support toward θ Θ 0 if Θ 0 A k. 8 / 23

Graphical representation of ML approach θ^mle(x) L x (θ) 0.1 * max θ L x (θ) A θ 9 / 23

Choice of the threshold k With k = 1 we only report theories with the highest score, 1. i.e., the set of maximum points of L x (θ). 2. Often there is one single point at which maximum is attained. 3. When this happens, the maximum point is called the maximum likelihood estimate (MLE) and is denoted ˆθ MLE (x). With k = 0 we report the whole set Θ not making any use of the data. 10 / 23

Important questions How to choose k? Should the same k be chosen for all types of models? In the opinion poll example, should we use the same k when n = 50 as we do for n = 500? All these would boil down to: How to interpret the choice of k in a quantitative manner and how to communicate it to a reader? We will find an answer through the paradigm of classical statistics. 11 / 23

Back to Opinion poll Data X = number of students (out of n) in favor of a policy. Statistical model: {Bin(n, p) : p [0, 1]}. 3 cases: n = 25, X = 10; n = 100, X = 40; n = 500, X = 200 L x (p) 0.00 0.05 0.10 0.15 n = 25 n = 100 n = 500 L x (p)/max p L x (p) 0.0 0.4 0.8 l x (p) - max p l x (p) -10-8 -6-4 -2 0 0.0 0.4 0.8 p 0.0 0.4 0.8 p 0.0 0.4 0.8 p Same ˆp MLE (x) = x/n = 0.4, different A k (x) with k = 0.15. [0.23, 0.59]; [0.31, 0.49]; [0.36, 0.44]. 12 / 23

More in favor than not? Any support toward θ 0.5? A 0.15 (x) [0.5, 1] only for (n = 25, X = 15). l x (p) - max p l x (p) -5-4 -3-2 -1 0 Θ 0 0.0 0.2 0.4 0.6 0.8 1.0 p 13 / 23

The Bayesian approach Convert L x (θ) into a plausibility score on Θ. Uncertainty about any unknown quantity can be summarized by a pdf/pmf. The parameter θ is one such quantity. Must have a pdf/pmf to describe θ before we observe data and one to describe it after we make the observation. 14 / 23

Prior and posterior Augment the model with a prior pdf π(θ) on Θ. π(θ) is the pre-data/a priori quantification of one s uncertainty about θ, with relative plausibility scores given by π(θ 1 )/π(θ 2 ). The post-data/a posteriori relative plausibility scores are π(θ 1 x) π(θ 2 x) = π(θ 1) π(θ 2 ) L x(θ 1 ) L x (θ 2 ) and correspond to the posterior pdf π(θ x) = L x (θ)π(θ) Θ L x(θ )π(θ )dθ, θ Θ 15 / 23

From prior to posterior Posterior formula is not ad-hoc: driven by probability theory! A model {f (x θ) : θ Θ}, coupled with the prior π(θ) gives a joint quantification of (X, θ) as: i.e., (X θ) f (x θ), θ π(θ), (X, θ) g(x, θ) = f (x θ)π(θ) where g(x, θ) is a pdf over S Θ. By Bayes theorem, the conditional pdf of θ given X = x is π(θ x) = g(x, θ) Θ g(x, θ )dθ = f (x θ)π(θ) Θ f (x θ )π(θ )dθ which is just as before because L x (θ) = f (x θ). 16 / 23

Reporting a Bayesian analysis π(θ x) captures the entire post-data quantification of the uncertainty about θ. A report is essentially visual/numerical summaries of this pdf. A plot of π(θ x), if available, is most useful! Numerical summaries include quantiles, mean, standard deviation, mode, high density regions, etc. 0.025-th & 0.975-th quantiles give a 95% posterior range of θ To evaluate evidence toward θ Θ 0, simply calculate Pr(θ Θ 0 x) = π(θ x)dθ. Θ 0 17 / 23

Back to opinion poll For opinion poll example, take π(θ) to be the Unif(0, 1) pdf. π(p x) 0 5 10 15 20 n = 25 n = 100 n = 500 0.0 0.2 0.4 0.6 0.8 1.0 95% posterior range: [0.24, 0.59]; [0.31, 0.50]; [0.36, 0.44]. Pr(θ 0.5 x): 0.163, 0.023, 0.00000369. p 18 / 23

A two parameter problem Lactic acid concentrations X 1,, X n measured from cheese samples IID Model: X i N(µ, σ 2 ) Model parameters µ (, ), σ 2 > 0. Care only about µ Get a range for µ. Is µ 1? 19 / 23

Why problem? Likelihood function L x (µ, σ 2 ) compares (µ, σ 2 ) pairs. How to do it for µ alone? 20 / 23

ML approach: Profile likelihood ML reports a well supported set A k of (µ, σ 2 ) values Look at all distinct values of µ that appear in A k (paired with some σ 2 ). Report this set. Same as doing the following Define profile likelihood: L x (µ) = max σ 2 L x (µ, σ 2 ). Fix threshold k [0, 1]. Report A k (x) = {µ : L x(µ) k max µ L x(µ )}. 21 / 23

Bayes approach: marginal posterior pdf Prior pdf π(µ, σ 2 ) leads to posterior pdf π(µ, σ 2 x). But this describes a joint distribution of (µ, σ 2 ) given X = x. Interested only in µ? Integrate out σ 2 π (µ x) = π(µ, σ 2 x)dσ 2 and summarize µ based on the marginal pdf π (µ x). 22 / 23

Integrated likelihood The marginal prior pdf is π (µ) = π(µ, σ 2 )dσ 2. Conditional prior pdf of σ 2 given µ is π(σ 2 µ) = π(µ,σ2 ) π (µ). The marginal posterior pdf satisfies π (µ x) = L x (µ)π (µ) Lx (µ )π (µ )dµ where L x (µ) = L x (µ, σ 2 ) π(σ 2 µ)dσ 2. Bayes: consider average support for µ over all σ 2. ML: consider maximum support for µ at the best σ 2. 23 / 23