Empirical Risk Minimization is an incomplete inductive principle Thomas P. Minka

Similar documents
Expectation Propagation for Approximate Bayesian Inference

Density Estimation. Seungjin Choi

Introduction to Bayesian Inference

Bayesian Inference and MCMC

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

F denotes cumulative density. denotes probability density function; (.)

Choosing among models

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

CSE446: Clustering and EM Spring 2017

Graphical Models for Collaborative Filtering

Strong Lens Modeling (II): Statistical Methods

Part III. A Decision-Theoretic Approach and Bayesian testing

g-priors for Linear Regression

Modeling Uncertainty in the Earth Sciences Jef Caers Stanford University

Lecture : Probabilistic Machine Learning

Lecture 6: Model Checking and Selection

Winter 2019 Math 106 Topics in Applied Mathematics. Lecture 8: Importance Sampling

Contents. Part I: Fundamentals of Bayesian Inference 1

Bayesian Learning (II)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

A Very Brief Summary of Bayesian Inference, and Examples

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

BAYESIAN DECISION THEORY

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Bayesian Decision Theory

Bayesian Methods for Machine Learning

Bayesian Interpretations of Regularization

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Computational Cognitive Science

Suggested solutions to written exam Jan 17, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

The Jackknife-Like Method for Assessing Uncertainty of Point Estimates for Bayesian Estimation in a Finite Gaussian Mixture Model

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Patterns of Scalable Bayesian Inference Background (Session 1)

Statistics: Learning models from data

Overfitting, Bias / Variance Analysis

(1) Introduction to Bayesian statistics

Bayesian Decision Theory

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Statistical Methods for Astronomy

ECE531 Homework Assignment Number 6 Solution

Lecture 1: Bayesian Framework Basics

Modeling Environment

Answers and expectations

AST 418/518 Instrumentation and Statistics

CS6220: DATA MINING TECHNIQUES

Machine Learning 4771

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

STA 4273H: Sta-s-cal Machine Learning

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Support Vector Machines

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

PATTERN RECOGNITION AND MACHINE LEARNING

CSC321 Lecture 18: Learning Probabilistic Models

Lecture 8 October Bayes Estimators and Average Risk Optimality

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Networks BY: MOHAMAD ALSABBAGH

David Giles Bayesian Econometrics

Lecture 2: Basic Concepts of Statistical Decision Theory

p(d θ ) l(θ ) 1.2 x x x

Introduction to Machine Learning

Frequentist-Bayesian Model Comparisons: A Simple Example

What does Bayes theorem give us? Lets revisit the ball in the box example.

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Inference (A Rough Guide)

Parametric Techniques Lecture 3

Computational Cognitive Science

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Confidence Intervals

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

CPSC 540: Machine Learning

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Stable Limit Laws for Marginal Probabilities from MCMC Streams: Acceleration of Convergence

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Inference for DSGE Models. Lawrence J. Christiano

AMS 5 NUMERICAL DESCRIPTIVE METHODS

7. Estimation and hypothesis testing. Objective. Recommended reading

Parametric Techniques

New Bayesian methods for model comparison

Chapter 4. Continuous Random Variables 4.1 PDF

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

Generative Clustering, Topic Modeling, & Bayesian Inference

Computation and Monte Carlo Techniques

Transcription:

Empirical Risk Minimization is an incomplete inductive principle Thomas P. Minka February 20, 2001 Abstract Empirical Risk Minimization (ERM) only utilizes the loss function defined for the task and is completely agnostic about sampling distributions. Thus it only covers half of the story. Furthermore, ERM is equivalent to Bayesian decision theory with a particular choice of prior. 1 ERM is incomplete Suppose we are given a data set D = {x 1,...,x } and we want to predict future data in a way that incurs minimum expected loss. That is, we want to pick a number θ to minimize R = p(x D)L(θ, x)dx (1) x where L is the loss in predicting θ when the real value is x and p(x D) represents what we know about future x after having observed D. This is the decision-theoretic method of estimation: (1) find the predictive density p(x D) and (2) choose the estimate which minimizes loss on that density. The first step employs a sampling model for the data and is irrespective of the loss. The second step employs a loss function for the task and is irrespective of the sampling model. Empirical Risk Minimization instead chooses θ to minimize the average loss we would have incurred on the data set D: ER = 1 L(θ,x i ) (2) which is seen as a Monte Carlo estimate of (1). The problem is that this method completely discards any information we may have about the sampling distribution of x. The following examples demonstrate. i 1

1.1 Gaussian data with squared loss Let the loss function be quadratic and let the data be independent Gaussian with unit variance and unknown mean: L(θ,x) = (θ x) 2 (3) p(x m) (m,1) (4) 1 = exp( 1 2π 2 (x m)2 ) (5) For the Bayesian approach, we also need a prior on m in order to get a predictive density p(x D). Let the prior be essentially uniform, e.g. a Gaussian with very large variance. Then the predictive density is p(x D) = p(x m)p(m D)dm (6) m p(x m)(m; x, 1 )dx (7) m = ( x,1+ 1 ) (8) x = 1 x i (9) i ow, after we have computed the predictive density, we apply the loss function and minimize R = p(x D)(θ x) 2 dx (10) x For any p(x D), the minimum loss is achieved by the estimate ˆθ = xp(x D)dx = E[x D] (11) which in this case is x, the sample mean. What about Empirical Risk Minimization? The empirical risk is x ER = 1 (θ x i ) 2 (12) whose minimum is at ˆθ = x. So in this case the two methods coincide. i 2

1.2 Gaussian data with absolute loss Continue the previous setup but now let the loss function be L(θ,x) = θ x (13) What does decision theory tell us to do? The predictive density p(x D) is the same as before, since it doesn t depend on L. We just need to minimize R = p(x D) θ x dx (14) x For any p(x D), the minimum loss is achieved at the median of the predictive distribution, i.e. the point where p(x < ˆθ D) = p(x > ˆθ D) (15) Since the predictive density is Gaussian, the median is the same as the mean, and ˆθ = x as before. What about Empirical Risk Minimization? The empirical risk is ER = 1 θ x i (16) whose minimum is at the sample median of the data. That is, ˆθ is chosen so that the number of samples less than ˆθ is equal to the number of samples greater than ˆθ. For odd, the median is unique, while for even there is an interval of equally valid minima. The main point here is that the estimate is not the same as that prescribed by decision theory. Which estimate is better? Decision theory says the sample mean must be optimal, and it is hard to doubt this considering Empirical Risk is only an approximation to the Bayes risk. evertheless we can confirm it with a simulation. We pick a value of m, draw data points with the distribution (m,1), compute the two estimators and measure the error on new data. There is no need to actually sample new data since we know the expected loss will be i E[ x ˆθ ] = 2(ˆθ;m,1)+2(m ˆθ)φ(m ˆθ) (m ˆθ) (17) φ(y) = y (x;0,1)dx (18) We repeat the whole procedure many times (but holding m fixed), and plot a histogram of the losses. Figure 1 shows a clear win for the sample mean over the median. The average testdata loss over 5000 trials was 0.8176 for the sample mean and 0.8267 for the sample median. Changing the amount of training data changes the absolute loss values but doesn t change the overall shape of the curves. 3

1 0.8 Fraction of trials 0.6 0.4 0.2 Sample mean Sample median 0 0.75 0.8 0.85 0.9 0.95 1 1.05 1.1 Test data loss Figure 1: Cumulative histogram of the test-data loss for Bayesian decision theory (sample mean) and Empirical Risk Minimization (sample median) in Example 2. For each loss value, the curves show the percentage of trials (out of 5000) which had a test-data loss lower than that value. Higher curves are better. The true mean was m = 5 and there were = 20 training points. 4

1.3 Laplacian data with squared loss Instead of changing the loss function, now we change the sampling distribution. Let the loss function be quadratic and let the data be independent Laplacian with unit variance and unknown mean: L(θ,x) = (θ x) 2 (19) p(x m) = 1 exp( x m ) (20) 2 Since the loss function is the same as Example 1, Empirical Risk Minimization will again pick the sample mean. What about decision theory? With a uniform prior on m, the predictive distribution is p(x D) = p(x m)p(m D)dm (21) p(m D) = m exp( x i m ) exp( x i m )dm Since the loss function is quadratic, the minimum loss is achieved by the estimate ˆθ = E[x D] = E[m D] as before. This integral can be computed analytically as shown in Appendix A. With more than 10 data points, it is also reasonable to approximate E[m D] by the m which maximizes p(m D), i.e. the maximum-likelihood estimate. In this case, maximizing the likelihood is equivalent to minimizing i x i m, which we already saw is minimized by the sample median. So now the tables are turned: Bayesian decision theory picks the sample median while ERM picks the sample mean. Which is better? Figure 2 shows the result of a simulation. The test-data loss is computed analytically as (22) E[(x ˆθ) 2 ] = (m ˆθ) 2 +2 (23) Even though the sample median is only an approximation to the Bayes optimum, it still does better than ERM. The average test-data loss over 5000 trials was 2.063 for exact Bayes, 2.066 for sample median, and 2.099 for sample mean. 5

1 0.8 Fraction of trials 0.6 0.4 0.2 Exact Bayes Sample median Sample mean 0 2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 Test data loss Figure 2: Cumulative histogram of the test-data loss for Bayesian decision theory (implemented exactly vs. approximately by the sample median) and Empirical Risk Minimization (sample mean) in Example 3. The true mean was m = 5 and there were = 20 training points. 6

2 ERM is a special case of Bayesian decision theory Is there a choice of sampling distribution which makes the Bayes risk(1) identical to the empirical risk (2)? Yes, all we need is for the predictive density p(x D) to be a sum of impulses at the training points: p(x D) = 1 δ(x x i ) (24) In other words, we have a maximally vague model, which assumes nothing beyond the training data. The only things we expect are things we have seen before. Such a model can be represented mathematically by a Dirichlet process [1] with parameter zero (which Ferguson calls the noninformative Dirichlet prior ). 3 Maximum-likelihood This paper has focused on the deficiencies in ERM. Maximum-likelihood is also deficient, but in different ways. Maximum-likelihood does employ a sampling distribution, but it does not model uncertainty in the free parameters and it does not use a loss function. Maximum-likelihood does pretty well on the examples in this paper, matching the optimal result in the first two examples and approximately optimal in the third example. But because of these deficiencies one can construct counterexamples for maximum-likelihood as well. References [1] T. S. Ferguson. A Bayesian analysis of some nonparametric problems. Annals of Statisitcs, 1(2):209 230, 1973. 7

A Posterior mean of the Laplacian likelihood We want to compute E[m D] = mexp( x i m )dm exp( x i m )dm (25) Renumber the data points so that they are sorted: x 1 x 2 x. By splitting the range of integration into the separate intervals (,x 1 ),(x 1,x 2 ),,(x, ), we get a series of simpler integrals. In general we get f(m)exp( x i m )dm = exp( For f(m) = 1 the inner integrals are x1 x i ) f(m)exp(m)dm+ (26) x2 x i ) f(m)exp(( 2)m)dm+ (27) i=2 x 1 x3 x i x i ) f(m)exp(( 4)m)dm+(28) i=3 x 2 x i ) f(m)exp( m)dm (29) x exp(x 1 exp( 2 +exp( x1 exp(m)dm = exp(x 1 )/ (30) { exp(( 2i)xi+1 ) exp(( 2i)x i ) if 2i 0 exp(( 2i)m)dm = 2i x i x i+1 x i if 2i = 0 (31) xi+1 x exp( m)dm = exp( x )/ (32) For f(m) = m the inner integrals are xi+1 x1 mexp(m)dm = exp(x 1 ) x 1 1 (33) 2 { exp(( 2i)xi+1 )(( 2i)x i+1 1) exp(( 2i)x i )(( 2i)x i 1) if 2i 0 ( 2i) exp(( 2i)m)dm = 2 x (34) 2 i+1 x x2 i i if 2i = 0 2 exp( m)dm = exp( x ) x 1 (35) x 2 8