Data Analysis and Uncertainty Part 2: Estimation

Similar documents
Density Estimation. Seungjin Choi

Introduction to Machine Learning

Lecture 2: Priors and Conjugacy

A Very Brief Summary of Bayesian Inference, and Examples

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

David Giles Bayesian Econometrics

Parametric Techniques Lecture 3

Foundations of Statistical Inference

Parametric Techniques

Principles of Bayesian Inference

Principles of Bayesian Inference

Principles of Bayesian Inference

CS540 Machine learning L9 Bayesian statistics

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.


Introduction to Bayesian Methods

Introduction to Bayesian Statistics 1

Bias-Variance Trade-off in ML. Sargur Srihari

Bayesian Inference and MCMC

Linear Models A linear model is defined by the expression

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Topic 12 Overview of Estimation

Lecture 2: Conjugate priors

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probability and Estimation. Alan Moses

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

One-parameter models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Chapter 5. Bayesian Statistics

Time Series and Dynamic Models

Bayesian Methods for Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

10. Exchangeability and hierarchical models Objective. Recommended reading

Bayesian Regression Linear and Logistic Regression

Introduction to Probabilistic Machine Learning

Patterns of Scalable Bayesian Inference Background (Session 1)

Bayesian Decision and Bayesian Learning

2016 SISG Module 17: Bayesian Statistics for Genetics Lecture 3: Binomial Sampling

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Bayesian RL Seminar. Chris Mansley September 9, 2008

STAT 830 Bayesian Estimation


Principles of Bayesian Inference

Bayesian Inference: Concept and Practice

Learning Bayesian network : Given structure and completely observed data

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Introduc)on to Bayesian Methods

Theory of Maximum Likelihood Estimation. Konstantin Kashin

COS513 LECTURE 8 STATISTICAL CONCEPTS

STAT 425: Introduction to Bayesian Analysis

Classical and Bayesian inference

Machine Learning using Bayesian Approaches

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Introduction into Bayesian statistics

7. Estimation and hypothesis testing. Objective. Recommended reading

A Very Brief Summary of Statistical Inference, and Examples

Discrete Binary Distributions

CS540 Machine learning L8

CSC321 Lecture 18: Learning Probabilistic Models

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

an introduction to bayesian inference

Approximate Bayesian Computation

Hierarchical Models & Bayesian Model Selection

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture : Probabilistic Machine Learning

Computational Perception. Bayesian Inference

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Lecture 18: Bayesian Inference

Foundations of Statistical Inference

Inference for a Population Proportion

Introduction to Machine Learning. Lecture 2

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Bayesian analysis in nuclear physics

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Introduction to Bayesian Statistics. James Swain University of Alabama in Huntsville ISEEM Department

Bayesian Inference for Normal Mean

9 Bayesian inference. 9.1 Subjective probability

STA 4273H: Sta-s-cal Machine Learning

Bayesian Inference in Astronomy & Astrophysics A Short Course

Department of Statistics

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

PMR Learning as Inference

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Transcription:

Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1

Topics in Estimation 1. Estimation 2. Desirable Properties of Estimators 3. Maximum Likelihood Estimation Examples: Binomial, Normal 4. Bayesian Estimation Examples: Binomial, Normal 1. Jeffreyʼs Prior 2

Estimation In inference we want to make statements about entire population from which sample is drawn Two most important methods for estimating parameters of a model: 1. Maximum Likelihood Estimation 2. Bayesian Estimation 3

Desirable Properties of Estimators Let ˆ θ be an estimate of Parameter θ Two measures of estimator quality ˆ θ 1. Expected Value of Estimate (Bias) Difference between expected and true value Bias( ˆ θ ) = E[θ ] θ Measures Systematic departure from true value 2. Variance of Estimate Var(θ ) = E[θ E[θ ]] 2 Expectation over all possible data sets of size n Data driven component of error in estimation procedure E.g., Always saying θ =1 has a variance of zero but high bias Mean Squared Error E[(θ θ) can be partitioned 2 ] as sum of bias 2 and variance

Bias-Variance in Point Estimate True height of Chinese emperor: 200cm, about 6ʼ6 Poll a random American: ask How tall is the emperor? We want to determine how wrong they are, on average Each scenario has expected value of 180 (or bias error = 20), but increasing variance in estimate 200 180 Bias No variance Bias Some variance 200 180 Bias More variance 200 180 Scenario 1 Everyone believes it is 180 (variance=0) Answer is always 180 The error is always -20 Ave squared error is 400 Average bias error is 20 400=400+0 Scenario 2 Normally distributed beliefs with mean 180 and std dev 10 (variance 100) Poll two: One says 190, other 170 Bias Errors are -10 and -30 Average bias error is -20 Squared errors: 100 and 900 Ave squared error: 500 500 = 400 + 100 Scenario 3 Normally distributed beliefs with mean 180 and std dev 20 (variance=400) Poll two: One says 200 and other 160 Errors: 0 and -40 Ave error is -20 Sq. errors: 0 and 1600 Ave squared error: 800 800 = 400 + 400 Squared error = Square of bias error + Variance As variance increases, error increases

Mean Squared Error as a Criterion for Natural decomposition as sum of squared bias and its variance E[(θ θ) 2 ] = E[(θ E[θ ] + E[θ ] θ) 2 ] = (E[θ ] θ) 2 + E[(θ E[θ ]) 2 ] = (Bias(θ )) 2 + Var(θ ) θ ˆ Mean squared error (over data sets) is a useful criterion since it incorporates both bias and variance 6

Maximum Likelihood Estimation Most widely used method for parameter estimation Likelihood Function is probability that data D would have arisen for a given value of θ L(θ D) = L(θ x(1),..., x(n)) A scalar function of θ = p(x(1),..., x(n) θ) n = p(x(i) θ) i=1 Value of θ for which the data has the highest probability is the MLE

Example of MLE for Binomial Customers either purchase/ not purchase milk We want estimate of proportion purchasing Binomial with unknown parameter θ Binomial is a generalization of Bernoulli Bernoulli: probability of a binary variable x=1 or 0 Denoting p(x=1)=θ probability mass function Bern(x θ)=θ x (1-θ) 1-x Mean= θ, variance = θ(1-θ) Binomial: probability of r successes in n trials Bin(r n,θ) = n θ r (1 θ) n r r Mean= nθ, variance = nθ(1-θ)

Likelihood Function: Binomial/Bernoulli Samples x(1),.. x(1000) where r purchase milk Assuming conditional independence, likelihood function is L(θ x(1),..,x(1000)) = θ x(i) (1 θ) n x(i) Binomial pdf includes every possible way of getting r successes so it has n C r additive terms Log-likelihood Function l(θ) = log L(θ) = rlogθ + (1000 r)log(1 θ) i = θ r (1 θ) 1000 r Differentiating and setting equal to zero 9 ˆ θ ML = r /1000

Binomial: Likelihood Functions r=7 n=10 r=70 n=100 r=700 n=1000 Likelihood function for three data sets Binomial distribution r milk purchases out of n customers θ is the probability that milk is purchased by random customer Uncertainty becomes smaller as n increases 10

Likelihood under Normal Distribution Unit variance, Unknown mean Likelihood function Log-likelihood function To find mle set derivative d/dθ to zero 11 n L(θ x(1),...,x(n)) = (2π) 1/ 2 exp 1 2 i=1 ( ) n / 2 exp 1 2 = 2π l(θ x(1),...,x(n)) = n 2 log2π 1 2 ˆ θ ML = x(i) /n i n i=1 n i=1 ( x(i) θ) 2 ( x(i) θ) 2 (x(i) θ) 2

Normal: Histogram, Likelihood Log-Likelihood Estimate unknown mean θ Histogram of 20 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function 12

Normal: More data points Histogram of 200 data points drawn from zero mean, unit variance Likelihood function Log-Likelihood function 13

Sufficient Statistics Useful general concept in statistical estimation Quantity s(d) is a sufficient statistic for θ if the likelihood l(θ) only depends on the data through s(d) Examples For binomial parameter θ number of successes r is sufficient For the mean of normal distribution sum of the observations Σ x(i) is sufficient for likelihood function of mean (which is a function of θ)

Interval Estimate Point estimate does not convey uncertainty associated with it Interval estimates provide a confidence interval Example: 100 observations from N(unknown µ,known σ 2 ) we want 95% confidence interval for estimate of m Since distribution of sample mean is N(µ,σ 2 /100) 95% of normal distribution lies within 1.96 std dev of mean 15 p(µ-1.96σ/10)<x_<µ+1.96σ/10)=0.95 Rewritten as P(x_-1.96σ/10) <x_<1.96σ/10)=0.95 l(x)=x_-1.96σ/10 and u(x)=x_+1.96σ/10 is a 95% interval

Bayesian Approach Frequentist Approach Parameters are fixed but unknown Data is a random sample Intrinsic variablity lies in data D={x(1),..x(n)} Bayesian Statistics Data are known Parameters θ are random variables θ has a distribution of values p(θ) reflects degree of belief on where true parameters θ may be 16

Bayesian Estimation Distribution of probabilities for θ is the prior p(θ) Analysis of data leads to modified distribution, called the posterior p(θ/d) Modification done by Bayes rule p(θ D) = p(d θ)p(θ) p(d) Leads to a distribution rather than single value Single value obtainable: mean or mode (latter known as maximum aposteriori (MAP) method) MAP and MLE of θ may well coincide Since prior is flat preferring no single value MLE can be viewed as a special case of MAP procedure, which in turn is a restricted form of Bayesian estimation = ψ p(d θ)p(θ) p(d ψ) p(ψ)ψ

Summary of Bayesian Approach For a given data set D and a particular model (model = distributions for prior and likelihood) p(θ D) p(d θ)p(θ) In words: Posterior distribution given D (distribution conditioned on having observed the data) is Proportionl to product: prior p(θ) & likelihood p(d θ) If we have a weak belief about parameter before collecting data, choose a wide prior (normal with large variance) Larger the data, more dominant is the likelihood

Bayesian Binomial Single Binary variable X: wish to estimate θ =p(x=1) Prior for parameter in [0,1] is the Beta distribtn p(θ) θ α 1 (1 θ) β 1 Where α > 0, β > 0 are two parameters of this model Γ(α + β) Beta(θ α,β) = Γ(α)Γ(β) θα 1 (1 θ) β 1 Likelihood (same as for MLE): Combining likelihood and prior E[θ] = var[θ] = α α + β L(θ D) = θ r (1 θ) n r p(θ D) p(d θ)p(θ) = θ r (1 θ) n r θ α 1 (1 θ) β 1 = θ r+α 1 n r+β 1 (1 θ) mode[θ] = α -1 α +β - 2 αβ (α +β) 2 (α +β +1) We get another Beta distribution With parameters r+α and n-r+β and mean= If α=β=0 we get the standard MLE of r/n E[θ] = r + α n + α + β

Advantages of Bayesian Approach Retain full knowledge of all problem uncertainty Eg, calculating full posterior distribution on θ E.g., Prediction of new point x(n+1) not in training set D is done by averaging over all possible θ p(x(n +1) D) = p(x(n +1),θ D)dθ = p(x(n +1) θ)p(θ D)dθ Can average over all possible models Considerably more computation than max likelihood Natural updating of distribution sequentially p(θ D 1,D 2 ) p(d 2 θ) p(d 1 θ) p(θ) Since x(n+1) is conditionally independent of training data D 20

Predictive Distribution In equation to modify prior to posterior p(θ D) = p(d θ)p(θ) p(d) Denominator is called predictive distribution of D Represents predictions about value of D Includes uncertainty about θ via p(θ) and uncertainty about D when θ is known Useful for model checking = ψ p(d θ)p(θ) p(d ψ) p(ψ)ψ If observed data have only small probability then it is unlikely to be correct

Bayesian: Normal Distribution Suppose x comes from a Normal Distribution with unknown mean θ and known variance α x ~ N(θ,α) Prior distribution for θ is θ ~ N(θ 0,α 0 ) p(θ x) p(x θ)p(θ) After some algebra p(θ x) = = 1 2πα exp - 1 2α x θ 1 exp - 1 2πα 1 2 θ θ 1 ( ) 2 ( ) 2 /α 1 Normal prior altered to yield normal posterior 1 2πα exp - 1 x θ 0 2α 0 ( ) 2 where α 1 =(α 0-1 +α -1 ) -1 is the sum of precisions and θ 1 =α 1 (θ 0 /α 0 +x/α) weighted sum of mean and datum

Improper Priors If we have no idea of the normal mean then we could give a uniform distribution for prior: but would not be a density function But could adopt an improper prior which is uniform over regions where parameter may occur 23

Jeffreyʼs Prior Results for one prior may differ from another Use a reference prior Fisher Information Negative of expectation second derivative of loglikelihood Measures curvature or flatness of likelihood function Jeffreyʼs Prior I(θ x) = E[ 2 logl(θ x) θ 2 ] p(θ) I(θ x) Consistent no matter how parameter is transformed 24

Conjugate Priors Beta to Beta Normal to Normal Advantage of using conjugate families is that updating process replaced by simple updating of parameters 25

Credibility Interval Can obtain point estimate or interval estimate from posterior distribution When there is a single parameter the interval containing a given probability (say 90%) is a credibility interval Interpretation is more straightforward than frequentist confidence intervals 26

Stochastic Estimation Bayesian methods involve complicated joint distributions Drawing random samples from estimated distributions enable properties of the distributions of parameters to be estimated Called Markov Chain Monte Carlo (MCMC) 27