Lecture 2: Basic Concepts of Statistical Decision Theory

Similar documents
LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

ST5215: Advanced Statistical Theory

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

A General Overview of Parametric Estimation and Inference Techniques.

Lecture 12 November 3

Lecture 1: Bayesian Framework Basics

Lecture 7 October 13

Lecture 8 October Bayes Estimators and Average Risk Optimality

Bayesian statistics: Inference and decision theory

Statistics: Learning models from data

ECE531 Lecture 8: Non-Random Parameter Estimation

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Statistical Measures of Uncertainty in Inverse Problems

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Lecture 7 Introduction to Statistical Decision Theory

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Lecture 8: Information Theory and Statistics

MIT Spring 2016

3.0.1 Multivariate version and tensor product of experiments

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Lecture notes on statistical decision theory Econ 2110, fall 2013

ST5215: Advanced Statistical Theory

Econ 2148, spring 2019 Statistical decision theory

Machine Learning 4771

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Evaluating the Performance of Estimators (Section 7.3)

6.1 Variational representation of f-divergences

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Lecture 1: Probability Fundamentals

Peter Hoff Statistical decision problems September 24, Statistical inference 1. 2 The estimation problem 2. 3 The testing problem 4

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 2: Statistical Decision Theory (Part I)

MIT Spring 2016

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Conditional probabilities and graphical models

Econometrics I, Estimation

Lecture 5 September 19

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Lecture 4 October 18th

1.1 Basis of Statistical Decision Theory

Lecture 4: Lower Bounds (ending); Thompson Sampling

λ(x + 1)f g (x) > θ 0

QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS

Lecture 2: Statistical Decision Theory

Brief Review on Estimation Theory

ECE531 Lecture 4b: Composite Hypothesis Testing

Statistical Machine Learning Lecture 1: Motivation

9 Bayesian inference. 9.1 Subjective probability

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CS 540: Machine Learning Lecture 2: Review of Probability & Statistics

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

COS513 LECTURE 8 STATISTICAL CONCEPTS

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Part III. A Decision-Theoretic Approach and Bayesian testing

Introduction to Bayesian Learning. Machine Learning Fall 2018

Testing Statistical Hypotheses

Parametric Techniques Lecture 3

A Very Brief Summary of Statistical Inference, and Examples

Sequential Decisions

Lecture 32: Asymptotic confidence sets and likelihoods

LECTURE NOTE #3 PROF. ALAN YUILLE

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

STAT215: Solutions for Homework 2

Introduction to Machine Learning. Lecture 2

2 Inference for Multinomial Distribution

Non-Parametric Bayes

Lecture : Probabilistic Machine Learning

Parametric Techniques

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

INTRODUCTION TO BAYESIAN METHODS II

Functional Conjugacy in Parametric Bayesian Models

Concentration inequalities and the entropy method

Lecture 4. f X T, (x t, ) = f X,T (x, t ) f T (t )

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Foundations of Nonparametric Bayesian Methods

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

1 Hypothesis Testing and Model Selection

4 Invariant Statistical Decision Problems

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Reproducing Kernel Hilbert Spaces

Statistical Inference

Nearest Neighbor Pattern Classification

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

COMP90051 Statistical Machine Learning

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

System Identification, Lecture 4

CSC321 Lecture 18: Learning Probabilistic Models

System Identification, Lecture 4

Introduction to Probability and Statistics (Continued)

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Extreme Value Analysis and Spatial Extremes

Bayesian Inference in Astronomy & Astrophysics A Short Course

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Foundations of Statistical Inference

Detection and Estimation Chapter 1. Hypothesis Testing

g-priors for Linear Regression

L11: Pattern recognition principles

Transcription:

EE378A Statistical Signal Processing Lecture 2-03/31/2016 Lecture 2: Basic Concepts of Statistical Decision Theory Lecturer: Jiantao Jiao, Tsachy Weissman Scribe: John Miller and Aran Nayebi In this lecture 1, we will introduce some of the basic concepts of statistical decision theory, which will play crucial roles throughout the course. 1 Motivation The process of inductive inference consists of 1. Observing a phenomenon 2. Constructing a model of that phenomenon 3. Making predictions using this model. The main idea behind learning algorithms is to generalize past observations to make future predictions. Of course, given a set of observations, one can always find a function to exactly fit the observed data, or to find a probability measure that generates the observed data with probability one. However, without placing any restrictions on how the future is related to the past, the No Free Lunch theorem essentially says that it is impossible to generalize to new data. In other words, data alone cannot replace knowledge. Hence, a central theme of this course is Generalization = Data + Knowledge. We can view statistical decision theory and statistical learning theory as different ways of incorporating knowledge into a problem in order to ensure generalization. 2 Decision Theory 2.1 Basic Setup The basic setup in statistical decision theory is as follows: We have an outcome space X and a class of probability measures {P θ : θ Θ}, and observations X P θ, X X. An example would be the Gaussian measure, defined in the above framework as follows: dp θ dx = 1 σ (x µ) 2 2π e 2σ 2 ; θ = (µ, σ 2 ). Our goal as a statistician is to estimate g(θ), where g is an arbitrary function that is known to us and θ is unknown. Thus, observe that our goal is more general than simply estimating θ, and the generality of our goal is motivated by the fact that the parametrization θ can be arbitrary. A useful perspective to view the goal of estimating g(θ) is as a game between Nature and the statistician, where Nature chooses θ and the statistician chooses a decision rule δ(x). Note that a general decision rule may be randomized, i.e., for any realization of X = x, δ(x) produces an action a A following the probability 1 Some readings: 1. Olivier Bousquet, Stéphane Boucheron, Gábor Lugosi (2004) Introduction to Statistical Learning Theory. 2. Lawrence D. Brown (2000) An Essay on Statistical Decision Theory. 3. Abraham Wald (1949) Statistical Decision Functions. 1

distribution D(da x). Intuitively, after observing x, the decision rule chooses the action a A randomly from the distribution D(da x). As a special case, for a deterministic decision rule δ(x), D(da x) is just a one-point probability mass at δ(x). Moreover, we have a loss function L(θ, a) : Θ A R to quantify the loss between θ and any action a A, such that (usually we assume that g(θ) belongs to the action space A) L(θ, a) 0, a A, θ Θ (1) L(θ, g(θ)) = 0, θ Θ. (2) Immediately, we see that our loss function L(θ, δ(x)) is inherently a random quantity. Note that we may have two sources of randomness: the observation X is inherently random, and for a fixed X the action produced by the decision rule δ(x) D(da X) may also be random. Instead, we will define a risk function R(θ, δ) in terms of our loss function, which will instead be deterministic. Definition 1 (Risk). For general randomized decision rule δ(x): R(θ, δ) E θ [L(θ, δ(x))] = L(θ, a)d(da x)dp θ (x). (3) When the decision rule δ(x) is deterministic, we have R(θ, δ) = L(θ, δ(x))dp θ (x). Intuitively, the risk R(θ, δ) is the average loss incurred by using the decision procedure δ over many draws of data from P θ. 2.2 Comparing Procedures Decision theory offers a principled way of analyzing the performance of different statistical procedures by evaluating and comparing their respective risks R(θ, δ). For example, suppose we observe data X P θ. Should we use maximum likelihood, method of moments, or some other procedure to estimate g(θ)? Decision theory allows us to rule out certain inadmissible procedures. Definition 2 (Admissibility). A decision rule δ(x) is called inadmissible if there exists some decision rule δ (X) such that 1. R(θ, δ ) R(θ, δ) for all θ Θ, and 2. there exists θ 0 Θ such that R(θ 0, δ ) < R(θ 0, δ). Such a rule δ is said to dominate δ. If there does not exist any rule δ (X) which dominates δ(x), δ(x) is called admissible. Having ruled out inadmissible decision rules, one might seek to find the best rule, namely the rule that uniformly minimizes the risk. However, in most cases of interest, there is no such uniformly best decision rule. For instance, if Nature chooses θ 0, then the trivial procedure δ(x) = g(θ 0 ) achieves the minimum possible risk R(θ 0, δ(x)) = 0. Thus, in most cases δ is admissible. However, such a decision rule is intuitively unreasonable and, further, can have arbitrarily high risk for θ θ 0. There are two main approaches to restrict the order of the game to avoid this triviality, namely, we can restrict the statistician, or we can assume more knowledge about Nature. 2.3 Restrict the Statistician Let D be the set of all possible decision rules. One way to restrict the statistician is to force her to choose a decision rule from some subset D D. How should we choose D? Some possibilities are: 2

1. Take D to be the set of unbiased decisions, namely, the decisions δ(x) such that E θ [δ(x)] = g(θ), θ Θ. 2. Equivariance (which we will not discuss as much in class): As an example, if P θ = f(x θ), where θ R, f(x) 1 2π e x2 /2 i.i.d., and our observations X 1,..., X n P θ, then we will choose our decision function δ to be such that δ(x 1 + c, X 2 + c,..., X n + c) = δ(x 1,..., X n ) + c, where c is a constant. The reason for why we might want to choose such a decision function δ is because, for equivariant estimators, the variance of the estimator does not depend on θ. 2.4 Assume more knowledge about Nature The downside of restricting the statistician is that we may be ignoring good decision rules δ. Instead, an alternative approach is to assume more knowledge about Nature. Here, we run into the classic Bayesian vs. Frequentist debate, which can be viewed as two different ways of assuming more knowledge about Nature. 1. Bayes: We assume that θ is drawn from a probability density, namely, θ λ(θ). For instance, for the problem of disease detection, our prior λ(θ) could be the population disease density. Thus, in the Bayesian setting, our new risk function becomes r(δ) = R(θ, δ)λ(θ)dθ. (4) As a result, to minimize the risk, the optimization problem becomes: min δ r(δ), (5) which is a well-posed optimization problem only over δ (and can be solved in many cases, though it can be computationally intractable to compute). It is known that if θ is finite dimensional, our observations X 1,..., X n P θ, and λ(θ) is supported everywhere, then as n, λ(θ) has no effect on inference asymptotically. However, there are a couple catches here to notice. The first catch is that here we are assuming n, but if n is finite, then λ(θ) will influence the decision rule rather significantly. The second catch is that we are assuming θ is finite dimensional, which does not hold generally in semiparametric and nonparametric models, e.g., if we are trying to estimate a function. 2. Frequentist: The frequentist approach is to not impose a prior, which leads to a natural choice to guard against the worst case risk using a minimax approach. Namely, the risk in the minimax setting becomes r(δ) = max R(θ, δ), (6) θ and our optimization problem is as usual, 3 Data Reduction min δ r(δ). (7) Not all data is relevant to a particular decision problem. Indeed, the irrelevant data can be discarded and replaced with some statistic T (X n ) of the data without hurting performance. We make this precise below. 3

3.1 Sufficient Statistics Definition 3 (Markov chain). Random variables X, Y, Z are said to form a Markov Chain if X and Z are conditionally independent given Y. In particular, the joint distribution can be written as p(x, Y, Z) = p(x)p(y X)p(Z Y ) = p(z)p(y Z)p(X Y ). Definition 4 (Sufficiency). 2 A statistic T (X) is sufficient for the model P = {P θ : θ Θ} if and only if the following Markov chains hold for any distribution on θ: 1. θ T (X) X 2. θ X T (X) One useful interpretation of the first condition is, if we know T (X), we can generate X without knowing θ (because X and θ are independent conditioned on T (X)). The second condition usually trivially holds, since in most cases T (X) is a deterministic function of X. However, a subtle but important point about the second condition is that the statistic T (X) could be a random function of X. In this case, the second condition implies that given X, T (X) does not depend on the randomness of θ. 3.2 Neyman-Fisher Factorization Criterion Often checking the definition of sufficiency can be difficult. In cases where the model distributions admit densities, we can give a characterization that is easier to work with. We first give technical conditions necessary for the existence of density functions. Let (Ω, F) be a measurable space. Recall A F is null for a measure µ on (Ω, F) if µ(b) = 0 for every B F with B A. Definition 5 (Absolute continuity). Suppose µ and ν are measures on (Ω, F). We call ν is absolutely continuous with respect to µ if every null set of µ is also a null set of ν. We write ν µ. Definition 6 (Dominated statistical model). The statistical model (X, {P θ : θ Θ}) is dominated by measure µ if, for all θ Θ, P θ µ. Theorem 1 (Neyman-Fisher Factorization Criterion). Suppose P = {P θ : θ Θ} is dominated by a σ-finite measure µ. In particular, P θ has a density p θ = dp θ dµ for all P θ P. Then T (X) is sufficient iff there exists functions g θ and h such that p θ (x) = g θ (T (x))h(x). Example 2. Let X 1, X 2,..., X n be i.i.d. Poisson(λ) and θ = λ. Then the joint distribution is p θ (X 1, X 2,..., X n ) = n i=1 λxi λ e X i! = e nλ λ n i=1 Xi X 1! X n! = g θ (T (X 1,..., X n ))h(x 1,..., X n ), where T (X) = n i=1 X i is sufficient by the Neyman-Fisher Factorization Criterion. 2 Note that this definition of Sufficiency is slightly different from the conventional frequentist definition, and is along the lines of the Bayes definition. These two definitions are equivalent under mild conditions. Cf.[1] 4

3.3 Rao-Blackwell In the decision theory framework, sufficient statistics provide a reduction of the data without loss of information. In particular, any risk that can be achieved using a decision rule based on X can also be achieved by a decision rule based on T (X), as the following theorem makes precise. Theorem 3. Suppose X P θ P and T is sufficient for P. For all decision rules δ(x) achieving risk R(θ, δ(x)), there exists a decision rule δ (T (X)) that achieves risk R(θ, δ (T (X))) R(θ, δ(x)) for all θ Θ. Sketch of Proof Given T (X), we can sample a new dataset X from the from the conditional distribution p(x T (X)). By sufficiency (the first condition in Definition 4), the conditional distribution p(x T (X)) doesn t depend on θ, and hence we can sample X without knowing θ. We then define a randomized procedure δ (T (X)) = δ(x ) d = δ(x). Since δ(x) and δ (T (X)) have the same distribution, they also have the same risk function. It is rarely necessary to regenerate a dataset from sufficient statistics. Rather, in the case of convex losses, it is possible to obtain a non-randomized decision rule that matches or improves the performance of the original rule using sufficient statistics alone. Definition 7 (Strict convexity). A function f : C R is strictly convex if C is a convex set and for all x, y C, x y, t (0, 1). f(tx + (1 t)y) < tf(x) + (1 t)f(y) Example 4. For any θ, the function L(θ, a) = (a θ) 2 is strictly convex with respect to a on R. Theorem 5 (Rao-Blackwell). Assume L(θ, a) is strictly convex in a. Then unless E[δ(X) T (X)] = δ(x) almost everywhere. R(θ, E[δ(X) T (X)]) < R(θ, δ(x)), Rao-Blackwell is stronger than the previous theorem because, when the loss function is strictly convex, it it possible to improve upon the original decision rule using only a reduced version of the data T (X) and using a deterministic rule. References [1] Blackwell, D., and R. V. Ramamoorthi. A Bayes but not classically sufficient statistic, The Annals of Statistics 10, no. 3 (1982): 1025-1026. 5