Physics 509: Bootstrap and Robust Parameter Estimation

Similar documents
Physics 509: Non-Parametric Statistics and Correlation Testing

Physics 509: Propagating Systematic Uncertainties. Scott Oser Lecture #12

Physics 509: Error Propagation, and the Meaning of Error Bars. Scott Oser Lecture #10

Big Data Analysis with Apache Spark UC#BERKELEY

Fitting a Straight Line to Data

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

MITOCW ocw f99-lec09_300k

Do students sleep the recommended 8 hours a night on average?

COMP6053 lecture: Sampling and the central limit theorem. Jason Noble,

COMP6053 lecture: Sampling and the central limit theorem. Markus Brede,

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

Bayesian Estimation An Informal Introduction

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

My data doesn t look like that..

Strong Lens Modeling (II): Statistical Methods

A Bayesian Approach to Phylogenetics

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

2. A Basic Statistical Toolbox

MITOCW MIT18_01SCF10Rec_24_300k

Chapter 9. Non-Parametric Density Function Estimation

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Statistics notes. A clear statistical framework formulates the logic of what we are doing and why. It allows us to make precise statements.

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Study and research skills 2009 Duncan Golicher. and Adrian Newton. Last draft 11/24/2008

Conditional probabilities and graphical models

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Business Statistics. Lecture 10: Course Review

Robert Collins CSE586, PSU Intro to Sampling Methods

Data modelling Parameter estimation

t-test for b Copyright 2000 Tom Malloy. All rights reserved. Regression

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

MITOCW R11. Double Pendulum System

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Uncertainty: A Reading Guide and Self-Paced Tutorial

The Inductive Proof Template

Statistical Distribution Assumptions of General Linear Models

Robert Collins CSE586, PSU Intro to Sampling Methods

Algebra Exam. Solutions and Grading Guide

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

MITOCW watch?v=ruz33p1icrs

MITOCW ocw f07-lec37_300k

CENTRAL LIMIT THEOREM (CLT)

Water tank. Fortunately there are a couple of objectors. Why is it straight? Shouldn t it be a curve?

Chapter 9. Non-Parametric Density Function Estimation

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

STA 4273H: Sta-s-cal Machine Learning

Multiple Regression Theory 2006 Samuel L. Baker

Maximum-Likelihood fitting

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

- a value calculated or derived from the data.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Practical Statistics

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Reading for Lecture 6 Release v10

Statistical Methods for Astronomy

3: Linear Systems. Examples. [1.] Solve. The first equation is in blue; the second is in red. Here's the graph: The solution is ( 0.8,3.4 ).

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

MITOCW ocw f99-lec01_300k

Probability and Statistics

Ordinary Least Squares Linear Regression

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur

Modern Methods of Data Analysis - SS 2009

Statistical Data Analysis Stat 3: p-values, parameter estimation

Lecture 8 Hypothesis Testing

Achilles: Now I know how powerful computers are going to become!

Bayesian Methods for Machine Learning

Answers and expectations

CPSC 340: Machine Learning and Data Mining

ECE521 week 3: 23/26 January 2017

MITOCW MITRES18_005S10_DerivOfSinXCosX_300k_512kb-mp4

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Instructor (Brad Osgood)

Intuitive Biostatistics: Choosing a statistical test

MITOCW ocw f07-lec39_300k

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Markov Chain Monte Carlo

MITOCW ocw f99-lec30_300k

MITOCW ocw f99-lec05_300k

Advanced Regression Topics: Violation of Assumptions

MITOCW MITRES18_006F10_26_0602_300k-mp4

Explanation of R 2, and Other Stories

Detection ASTR ASTR509 Jasper Wall Fall term. William Sealey Gosset

MITOCW ocw-18_02-f07-lec02_220k

Investigation of Possible Biases in Tau Neutrino Mass Limits

Quantitative Biology II Lecture 4: Variational Methods

Bayesian Econometrics

Final Exam. Name: Solution:

Quantitative Understanding in Biology 1.7 Bayesian Methods

Uncertainty, Error, and Precision in Quantitative Measurements an Introduction 4.4 cm Experimental error

6.080 / Great Ideas in Theoretical Computer Science Spring 2008

Transcription:

Physics 509: Bootstrap and Robust Parameter Estimation Scott Oser Lecture #20 Physics 509 1

Nonparametric parameter estimation Question: what error estimate should you assign to the slope and intercept from this fit? You are not given the error bars. You are not told the distribution of the errors. In fact, all you are told is that all residuals are independent and identically distributed. 2

Nonparametric parameter estimation This sounds like an impossible problem. For either an ML estimator or a Bayesian solution to the problem we need to be able to write down the likelihood function: N L= i=1 f y i y x i Here f( ) is the distribution of the residuals between the data and the model. If for example f( ) is a Gaussian, then the ML estimators becomes the least-squares estimator. If you don't know f( ) then you're seemingly screwed. Physics 509 3

Bootstrap The bootstrap method is an attempt to calculate the distributions of the errors from the data itself, and to use these to calculate the errors on the fit. After all, the data contains a lot of information about the errors: Physics 509 4

Description of the bootstrap It's a very simple technique: You start with a set of N independent and identically distributed observations, and calculate some estimator (x 1...x N ) of the data. To get the error on do the following: 1) Make a new data set by selecting N random observations from the data, with replacement. Some data points will get selected more than once, others not at all. Call this new data set X'. 2) Calculate X'. 3) Repeat the procedure many times (at least 100). The width of the distribution of the calculated from the resampled data sets gives you your error on. Effectively you use your own data to make Monte Carlo data sets. Physics 509 5

Justification for the bootstrap This sounds like cheating, and it has to be conceded that the procedure doesn't always work. But it has some intuitiveness. To calculate the real statistical error on you'd need to know the true distribution F(x) that the data is drawn through. Given that you don't know the form of F(x), you could try to estimate it with its nonparametric maximum likelihood estimator. This of course is just the observed distribution of x. You basically assume that your observed distribution of x is a fair measure of the true distribution F(x) and can be used to generate Monte Carlo data sets. Obviously this is going to work better when N is large, since the better your estimate of F(x) the more accurate your results are going to be. Physics 509 6

Advantages of the Bootstrap The bootstrap has a number of advantages: 1) Like Monte Carlo, you can use it to estimate errors on parameters that depend in a complicated way on the data. 2) You can use it even when you don't know the true underlying error distribution. It's nonparametric in this sense. 3) You don't have to generate zillions of Monte Carlo data sets to use it---it simply uses the data itself. Physics 509 7

Bootstrap example #1 Consider the following setup: 1000 data points are drawn from the distribution to the right. We sort them, and return the width between the 75% and 25% percentile points. We use this as a measure of the width of the distribution. We want to determine the uncertainty on our width parameter. Physics 509 8

Bootstrap example #1: error on width If we knew the true distribution, we could Monte Carlo it. Suppose we don't, but are only handed the set of 1000 data points. The histograms on the right show: top: width parameter distribution from 10000 independent Monte Carlo data sets bottom: width parameter distribution from bootstrap resampling of original data 9

Bootstrap example #2 What are the errors on the slopes and intercepts of this data? First fit the data for your best-fit line (as shown), using whatever estimator you like for the line (e.g. ML, least squares, etc.) Now calculate residuals distribution: 10

Bootstrap example #2: a line fit Generate new bootstrap data sets according to: y x i = m x i b i residual j where we randomly pick one of the N residual values to add to the bestfit value. We get m=1.047±0.024 and b= 4.57 ± 0.42 Physics 509 11

Evaluation of bootstrap example #2: a line fit The bootstrap data sets gave us: m=1.047± 0.024 b = 4.57 ± 0.42 But Monte Carlo data sets sampled from the true distribution give m=1.00 ± 0.038 b = 5.02 ± 0.68 The error estimates in this case don't agree well. The problem is that we got kind of unlucky in the real data---with only 30 data points, the observed residual distribution happens to be narrower than the true residual distribution. This is an example of a case where the bootstrap didn't work well. Physics 509 12

When not to use the bootstrap Regrettably the conditions in which the bootstrap gives bad results are not fully understood yet. Some circumstances to be wary of: 1) Small sample sizes (N<50) 2) Distributions with infinite moments 3) Using bootstraps to estimate extreme values (eg. random numbers are drawn between [a,b], and you want to estimate a and b.) 4) Any case where you have reason to suspect that the data don't accurately sample the underlying distribution (eg. you expect the residual to be symmetric based on physics but for the sample you have it isn't very symmetric.) Physics 509 13

Robust parameter estimation Almost every statistical test makes some implicit assumptions. Some make more than others. Example of assumptions that are often made and sometimes questionable: independent errors Gaussian errors some other specific model of the errors When these assumptions are violated, the test may break. Data outliers are a good example---if you're trying to compare the average incomes of college graduates with people who don't have college degrees, the accidental presence of drop-out Bill Gates in your random sample will really mess you up! A robust test is a test that isn't too sensitive to violations of the model assumptions. Physics 509 14

Breakdown point The breakdown point of a test is the proportion of incorrect observations with arbitrarily large errors the estimator can accept before giving an arbitrarily large result. The mean of a distribution has a breakdown point of zero: even one wildly wrong data point will pull the mean! The median of a distribution is very robust, however---up to half of the data points can be corrupted before the median changes (although this seemingly assumes in turn that the corrupted data are equally likely to lie above and below the median). Physics 509 15

Non-parametric tests as robust estimators You've already seen a number of robust estimators and tests: the median as a measure of the location of a distribution rank statistics as a measure of a distribution's width: for example, the data value of the 75% percentile minus the value at the 25% percentile Kolmogorov-Smirnov goodness of fit test Spearman's correlation coefficient The non-parametric tests we studied in Lecture 19 are in general going to be more robust, although less powerful, than more specific tests. Physics 509 16

M-estimator In a maximum likelihood estimator, we start with some PDF for the error residuals: f y data y model We then seek to minimize the negative log likelihood: i ln f y i y model x i This is all well justified in terms of probability theory. Therefore we can use this likelihood in Bayes' theorem, etc. Most commonly f() is a Gaussian distribution. An M-estimator is a generalization of the ML estimator. Rather than using a Gaussian distribution, or another distribution as dictated by your error model, use a different distribution designed to be more robust. Physics 509 17

How M-estimators work Define some fit function you want to minimize: N i=1 y i y x i i After taking the derivative with respect to the fit parameters we get: N 0= i=1 1 y i y x i y x i i i k for k=1, 2,.., M where (x) d /dx. This function is a weighting function that dictates how much deviant points are weighted. Let's see some examples... Physics 509 18

Examples of some M-estimators Gaussian errors: deviation enters as quantity squared, bigger deviants weighted more Absolute value: equivalent to minimizing y i y x i i Physics 509 19

Examples of some M-estimators Cauchy distribution: if errors follow a Lorentzian. Weighting function goes to zero. Tukey's biweight: x =x 1 x 2 /36 2 for x <6. All >6 deviations are ignored completely! Physics 509 20

An example: linear fit with an M-estimator Minimizing the sum: y i y x i i The black line is from a 2 fit, while the red line is from using the absolute value version. The red point goes much closer to the majority of data points. Fit may be more robust, but you don't get an error estimate easily. If you know the true underlying distribution, use MC, else try bootstrap to get the variance on the best fit. 21

What M-estimators don't do... You'd only use an M-estimator if you don't know the true probability distribution of the residuals, or to cover yourself in case you have the wrong error model. You cannot easily interpret the results in terms of probability theory--- Bayes' theorem is out, for example, as are most frequentist tests. About the only thing you can do to determine the errors on your parameters is to try the bootstrap method to estimate the errors (remembering that this assumes that the measured data has accurately measured the true underlying probability distribution). (Of course if you know the underlying probability distribution you can still use an M-estimator, and use probability theory to determine the PDFs for your estimator, but it would be a little strange to do so.) Physics 509 22

Masking outliers Outliers can be very tricky. A really big outlier can mask the presence of other outlying points. For example, suppose we decide in advance we'll throw out any point that is >3 from the mean. That rejects 4 points from the top plot. But if we look at what remains, the point at +5.0 is also pretty far out, although it passed our first cut. Do we iterate? Physics 509 23

Robust Bayesian estimators There really isn't any such thing as a non-parametric Bayesian calculation. Bayesian analyses need a clearly defined hypothesis space. But Bayesian analyses can be made more robust by intelligently parametrizing the error distribution itself with some free parameters. Consider fitting a straight line to this data. Physics 509 24

Naïve Bayesian result Fit using Gaussian errors with nominal errors of 0.5. 68% 1D credible regions: m= 0.9894 ± 0.0084 b = 5.731 ± 0.097 True data was drawn from m=1.0, b=5.0 Because the error estimates are just not realistic, we get an absurd result. Physics 509 25

Conservative Bayesian result Instead of fixing the errors at 0.5, make a free parameter. Give it a Jeffrey's prior between 0.1 and 20.0. P m,b, D, I 1 i=1 N 1 2 exp 1 y mx b i i 2 2 I used a Markov Chain Monte Carlo to calculate the joint PDF of these three parameters and to generate 1D PDFs for each by marginalizing over the other two parameters. Physics 509 26

Conservative Bayesian result m= 0.9904 ± 0.0799 b = 5.697 ± 0.957 = 4.41 ± 0.32 Results are consistent with true values, but with much bigger error. Remember: by Maximum Entropy principle, a Gaussian error assumption contains the least information of any error assignment assumption. Can we do better? Physics 509 27

Two-component Bayesian fit What if we model the errors as if some fraction are Gaussian and the others are from a Cauchy distribution (to give wide tails)? g =f 1 2 exp 1 2 2 2 1 f 1 1 1 / 2 Now there are 5 free parameters: m, b,, f, and. Again, Markov Chain Monte Carlo is the most efficient way to calculate this. I use uniform priors on f, which ranges from 0 to 1, and on the width of the Cauchy distribution (between 0.1 and 10). Results: m= 1.009 ± 0.015 b= 4.933 ± 0.168 Consistent with true values, and much small uncertainties than Gaussian fit. This comes at the price of some increased model dependence.

Two-component Bayesian fit: graphs In doing this problem I noticed a weird bimodal behaviour in the error parameters. In retrospect the answer is obvious: the data is equally well fit by a narrow Gaussian error and a wide Cauchy error as by a narrow Cauchy error and a wide Gaussian error. Doesn't affect the final answer, but perhaps I should have broken this degeneracy in my prior. 29

Notes of caution Obviously, be VERY CAUTIOUS about throwing out any data points or deweighting outliers, especially if you don't understand what causes them! If you look at the final answer before excluding outliers, you are certainly not doing a blind analysis, and there's an excellent chance you're badly biasing your result! (But you may be able to do blind outlier rejection if you plan for it in advance.) Almost all standard results and tests in probability theory fail for censored data. You're forced to use MC or bootstrap. Is that outlier actually a new discovery you're throwing out? For example, Mossbauer effect showed up as noise in Mossbauer's PhD thesis. He wasn't looking for it, and had he rejected it with an outlier cut would he still have won the Nobel Prize for his PhD? Physics 509 30