A Bayesian Analysis of Some Nonparametric Problems

Similar documents
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

HPD Intervals / Regions

Bayesian Nonparametrics

Curve Fitting Re-visited, Bishop1.2.5

Bayesian nonparametrics

Lecturer: David Blei Lecture #3 Scribes: Jordan Boyd-Graber and Francisco Pereira October 1, 2007

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

Stat 5101 Lecture Notes

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Nonparametric Bayes Estimator of Survival Function for Right-Censoring and Left-Truncation Data

Improved Simultaneous Estimation of Location and System Reliability via Shrinkage Ideas

Nonparametric Bayesian Methods - Lecture I

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

ST5215: Advanced Statistical Theory

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Dirichlet Processes: Tutorial and Practical Course

Lecture 2: Basic Concepts of Statistical Decision Theory

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Empirical Risk Minimization is an incomplete inductive principle Thomas P. Minka

On Simulations form the Two-Parameter. Poisson-Dirichlet Process and the Normalized. Inverse-Gaussian Process

Stat 5101 Lecture Slides: Deck 8 Dirichlet Distribution. Charles J. Geyer School of Statistics University of Minnesota

Lecture 2: Conjugate priors

Foundations of Nonparametric Bayesian Methods

On matrix variate Dirichlet vectors

2 Inference for Multinomial Distribution

Truncation error of a superposed gamma process in a decreasing order representation

Experience Rating in General Insurance by Credibility Estimation

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Hyperparameter estimation in Dirichlet process mixture models

A new distribution on the simplex containing the Dirichlet family

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

and Comparison with NPMLE

A Conditional Approach to Modeling Multivariate Extremes

Bayesian Nonparametric Inference Methods for Mean Residual Life Functions

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

APM 504: Probability Notes. Jay Taylor Spring Jay Taylor (ASU) APM 504 Fall / 65

Motif representation using position weight matrix

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Linear Models A linear model is defined by the expression

A Simple Proof of the Stick-Breaking Construction of the Dirichlet Process

Bayesian Statistics. Debdeep Pati Florida State University. April 3, 2017

Classical and Bayesian inference

STA 4273H: Statistical Machine Learning

Probability Background

CS281B / Stat 241B : Statistical Learning Theory Lecture: #22 on 19 Apr Dirichlet Process I

Part 4: Multi-parameter and normal models

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Markov Chain Monte Carlo (MCMC)

Bayesian Model Averaging

CSE 312 Final Review: Section AA

What to do today (Nov 22, 2018)?

Beyond Uniform Priors in Bayesian Network Structure Learning

STATISTICS SYLLABUS UNIT I

Remarks on Improper Ignorance Priors

Chapter 1. Sets and probability. 1.3 Probability space

4 Invariant Statistical Decision Problems

Notes on Decision Theory and Prediction

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

ECE285/SIO209, Machine learning for physical applications, Spring 2017

Bayesian methods in economics and finance

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Parameter Estimation

5 Introduction to the Theory of Order Statistics and Rank Statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Advanced Machine Learning

Hybrid Dirichlet processes for functional data

Bayesian Regularization

Applied Probability Models in Marketing Research: Introduction

Conditional distributions

Bayesian Methods for Machine Learning

Bayesian estimation of the discrepancy with misspecified parametric models

Reliable survival analysis based on the Dirichlet Process

Introduction to Machine Learning. Lecture 2

Learning Bayesian network : Given structure and completely observed data

Solutions to the Exercises of Section 2.11.

A CONDITION TO OBTAIN THE SAME DECISION IN THE HOMOGENEITY TEST- ING PROBLEM FROM THE FREQUENTIST AND BAYESIAN POINT OF VIEW

Review of Probabilities and Basic Statistics

Introduction to Bayesian Statistics 1

3 Joint Distributions 71

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Hybrid Censoring; An Introduction 2

David Giles Bayesian Econometrics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Nonparametric Bayesian Uncertainty Quantification

Measure-theoretic probability

Flexible Regression Modeling using Bayesian Nonparametric Mixtures

Foundations of Statistical Inference

An Introduction to Expectation-Maximization

A Brief Overview of Nonparametric Bayesian Models

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Solutions Homework 6

Statistics 3858 : Maximum Likelihood Estimators

Non-parametric Inference and Resampling

Quantization. Robert M. Haralick. Computer Science, Graduate Center City University of New York

1 Maximum Likelihood Estimation

STAT Advanced Bayesian Inference

Introduction to the Dirichlet Distribution and Related Processes

Priors for the frequentist, consistency beyond Schwartz

Transcription:

A Bayesian Analysis of Some Nonparametric Problems Thomas S Ferguson April 8, 2003 Introduction Bayesian approach remained rather unsuccessful in treating nonparametric problems. This is primarily due to the difficulty in finding workable prior distribution on the parameter space, which in nonparametric problems is taken to be a set of probability distributions on a given sample space. Two Desirable Properties of a Prior. The support of the prior should be large. 2. Posterior distribution given a sample of observation should be manageable analytically. These properties are antagonistic : One may be obtained at the expense of other. The Main Idea of the Paper It presents a class of prior distributions called Dirichlet process priors, broad in the sense of (), for which (2) is realized and for which treatment of many nonparametric problems may be carried out, yielding results that are comparable to the classical theory.

The Dirichlet Distribution Known to the Bayesians as the conjugate prior for the parameter of a multinomial distribution. G(α, β) denote the gamma distribution with shape parameter α 0, scale parameter β > 0. α = 0 it is degenerate at 0. Let z j iid G(α, ), j =,..., k. Where α j 0 j and α j > 0 for some j. The Dirichlet distribution with parameters (α,..., α k ) denoted by D(α,..., α k ) is defined as the distribution of (Y,..., Y k ), where Y j = Z j k i= Z, j =,..., k. () i Note: Since Y +... + Y k =, the distribution is always singular. If any α j = 0, the corresponding Y j is degenerate at 0. If α j > 0, j distribution of (Y,..., Y k ) is continuous with density f (y,..., y k ) (2) = Γ(α k +... + α k ) Γ(α )... Γ(α k ) ( k y α j j )( y j ) α k I S (y,..., y k ) j= j= k S = {(y,..., y k ) : y j 0, y j } for k = 2, (2) reduces to Beta distribution. j= 2

Properties of Dirichlet Distribution (i) If (Y,..., Y k ) D(α,..., α k ). Let r,..., r l are integers, 0 < r < < r l = k, then ( r Y i, r 2 r + Y i,..., r l r l Y i) D( r α i, r 2 r + α i,..., r l r l α i). (ii) Marginal distribution of Y j is Beta. Y j Be(α j, ( k ) α j) (iii) E(Y i ) = α i α E(Yi 2 ) = α i(α i +) E(Y i Y j ) = α(α+) α iα j α(α+), i j, α = k α i The Dirichlet Process X : a set. A : σ-field of subsets of X. α : finite non-null measure on (X, A). Define a random probability P by defining the joint distribution of (P (B ),..., P (B k )), k and all measurable partitions (B,..., B k ) of X. We say P is a Dirichlet process on (X, A) with parameter α if for everly k =, 2,..., and measurable partition (B,..., B k ) of X, the distribution of (P (B ),..., P (B k )) is Dirichlet, D(α(B ),..., α(b k )). 3

Proposition. Let P be a Dirichlet process on (X, A) with parameter α and let A A. If α(a) = 0, then P (A) = 0 with probability one. If α(a) > 0 then P (A) > 0 with probability one. Furthermore, E(P (A)) = α(a) α(x ). Proposition 2. Let P be a Dirichlet process on (X, A) with parameter α, and let Q be a fixed probability measure on (X, A) with Q α. Then, for any positive integer m and measurable sets A,..., A m, and ɛ > 0, P{ P (A i ) Q(A i ) < ɛ for i =,..., m} > 0. Proposition 3. Let P be a Dirichlet process on (X, A) with parameter α and let X be a sample of size from P. Then for A A, P(X A) = α(a)/α(x ). Theorem : Let P be a Dirichlet process on (X, A) with parameter α, and let X,..., X n be a sample of size n from P. Then conditional distribution of P given X,..., X n, is a Dirichlet process with parameter α+ n. Theorem 2 : Let P be the Dirichlet process and let Z be a measurable real valued function defined on (X, A). If Z dα <, then Z dp with probability one, and E( ZdP ) = ZdE(P ) = α(x ) Zdα. 4

Application P D(α) says P is a Dirichlet process on (X, A) with parameter α. R : The real line. B : The σ-field of Borel sets. (X, A) = (R, B) Nonparametric Statistical Decision Problem The parameter space is the set of all probability measures P on (X, A). Statistician is to choose an action a in some space,thereby incurring a loss, L(P, a). A sample X,..., X n from P are available to the statistician, upon which he may base his choice of action. What a Bayesian does? He seeks a Bayes rule with respect to the prior distribution, P D(α). The posterior distribution of P given the observations is D(α + n ). Find a Bayes rule for no sample problem ( n = 0 ). Bayes rule for the general problem is found by replacing α by α + n. 5

(a) Estimation of a distribution function Let (X, A) = (R, B), and the space of action is the space of all distribution function on R. Loss function : L(P, ˆF ) = (F (t) ˆF (t)) 2 dw (t) W : a given finite measure on (R, B) (a weight function). F (t) = P ((, t]) P D(α) F (t) Be(α((, t]), α((t, ])) for each t. Bayes risk for no sample problem, E(L(P, ˆF )) = E(F (t) ˆF (t)) 2 dw (t) Minimized by choosing ˆF (t) = E(F (t)). Thus the bayes rule for the no-sample problem is ˆF (t) = E(F (t)) = F 0 (t) Where, F 0 (t) = α((, t])/α(r) For a sample of size n the bayes rule is, ˆF n (t X,..., X n ) = α((, t]) + n ((, t]) α(r) + n = p n F 0 (t) + ( p n )F n (t X,..., X n ) p n = α(r)/(α(r) + n) F n (t X,..., X n ) = n n ((, t]) The Bayes rule is a mixture of our prior guess at F and the empirical distribution. α(r) can be interpreted as a measure of faith in the prior guess at F. 6

(b) Estimation of the mean Let (X, A) = (R, B). Loss function : L(P, ˆµ) = (µ ˆµ) 2 µ = xdp (x) Assume P D, where α has finite first moment.the mean of the corresponding probability measure α(.)/α(r) is, µ 0 = xdα(x)/α(r) By Theorem 2, the r.v µ exists. The Bayes rule for the no-sample problem is the mean of µ, ˆµ = µ 0, by Theorem 2. For a sample of size n, the Bayes rule is, ˆµ n (X,..., X n ) = (α(r) + n) xd(α(x) + n δ Xi (x)) = p n µ 0 + ( p n ) X n p n same as before and X n = n n X i. The Bayes estimate is between the prior guess at µ and the sample mean. As α(r) 0, ˆµ n converges to X n. As n, p n 0, i.e the Bayes estimate is strongly consistent within the class of distributions with finite first moments. i= 7

(c) Estimation of the median Let (X, A) = (R, B). We need to estimate the median, m = med P,of an unknown probability measure P on (R, B). For P D(α), m is a random variable. Loss function : L(P, ˆm) = m ˆm (absolute error loss). Any median of the distribution of m is a Bayes estimate of m. For the Dirichlet process prior, any median of the distribution of m is a median of the expectation of P : med(dist. med P ) = med E(P ) A number t is a median of the distribution of m iff P{m < t} 2 P{m t}. P{m t} = P{F (t) > 2 } F (t) has a Be(α((, t]), α((t, ])). Its median is a nondecreasing function of t, with value half iff the 2 parameters are equal. Hence t satisfies the above iff α((,t)) α((,t]). α(r) 2 α(r) such t are the medians of E(P ). Thus any number t satisfying the above is a Bayes estimate of m for prior D(α) and absolute error loss. For F 0 defined as in (a), ˆm = median of F 0. For a sample size n, the Bayes estimate is, ˆm n (X,..., X n ) = median of ˆF n. ˆF n is the Bayes estimate of F as obtained from (a). 8

(d) Estimation of F dg for a two-sample problem F and G are 2 distribution functions on the real line. Let X,..., X m be a sample from F and Y,..., Y n a sample from G. Estimation of = P (X Y ), = F dg Loss function: Squared error. Priors : F is the d.f. of P D(α ) G is the d.f. of P 2 D(α 2 ) P and P 2 are independent. The Bayes rule for the no-sample problem is, 0 = E( ) = F 0 dg 0 F 0 = E(F ) ; G 0 = E(G). Given 2 samples, Bayes rule is ˆ (X,..., X m, Y,..., Y n ) = ˆFm dĝn ˆF m and Ĝn are the respective Bayes estimate of F and G as in application (a). It can be written as, p,m = U = i α (R) ˆ (X,..., X m, Y,..., Y n ) = p,m p,n 0 + p,m ( p 2,n ) n + ( p,m )p 2,n n m ( G 0 (X i )) + ( p,m )( p 2,n ) mn U α 2(R) α 2 (R)+n n F 0 (Y j ), p α (R)+m 2,m = j I (,Y j )(X i ) is the Mann-Whitney statistic. 9