Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 78 le-tex

Similar documents
Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Density Estimation. We are concerned more here with the non-parametric case (see Roger Barlow s lectures for parametric statistics)

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 147 le-tex

Density Estimation (II)

Terminology for Statistical Data

ECE 5424: Introduction to Machine Learning

VBM683 Machine Learning

Understanding Generalization Error: Bounds and Decompositions

Evaluation requires to define performance measures to be optimized

General Least Squares Fitting

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

The Nonparametric Bootstrap

6x 2 8x + 5 ) = 12x 8

Study and research skills 2009 Duncan Golicher. and Adrian Newton. Last draft 11/24/2008

How do we compare the relative performance among competing models?

General Least Squares Fitting

Mathematical Tools for Neuroscience (NEU 314) Princeton University, Spring 2016 Jonathan Pillow. Homework 8: Logistic Regression & Information Theory

Solution: chapter 2, problem 5, part a:

Evaluation. Andrea Passerini Machine Learning. Evaluation

Regression I: Mean Squared Error and Measuring Quality of Fit

Chapter 8 ~ Quadratic Functions and Equations In this chapter you will study... You can use these skills...

CP Algebra 2. Unit 2-1 Factoring and Solving Quadratics

The Growth of Functions. A Practical Introduction with as Little Theory as possible

Error Functions & Linear Regression (1)

f (x) = x 2 Chapter 2 Polynomial Functions Section 4 Polynomial and Rational Functions Shapes of Polynomials Graphs of Polynomials the form n

Graphs of polynomials. Sue Gordon and Jackie Nicholas

Estimation of Parameters

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

[y i α βx i ] 2 (2) Q = i=1

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Topics in Probability and Statistics

40.2. Interval Estimation for the Variance. Introduction. Prerequisites. Learning Outcomes

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

2-2: Evaluate and Graph Polynomial Functions

Regression. Oscar García

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Tropical Polynomials

1 The Classic Bivariate Least Squares Model

How might we evaluate this? Suppose that, by some good luck, we knew that. x 2 5. x 2 dx 5

Bivariate distributions

Notes on Discriminant Functions and Optimal Classification

Partial Fractions. June 27, In this section, we will learn to integrate another class of functions: the rational functions.

Is cross-validation valid for small-sample microarray classification?

Linear Regression, Neural Networks, etc.

MA 575 Linear Models: Cedric E. Ginestet, Boston University Non-parametric Inference, Polynomial Regression Week 9, Lecture 2

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

BAYESIAN DECISION THEORY

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Section 3.2 Quadratic Functions & Their Graphs

An automatic report for the dataset : affairs

arxiv: v1 [physics.data-an] 3 Jun 2008

CHAPTER 2 POLYNOMIALS KEY POINTS

Part 4: Multi-parameter and normal models

Physics 509: Bootstrap and Robust Parameter Estimation

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

CMSC858P Supervised Learning Methods

Unit 2-1: Factoring and Solving Quadratics. 0. I can add, subtract and multiply polynomial expressions

4.2 Reducing Rational Functions

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Nonparametric Regression

Review of Probability Theory

SECTION 5.1: Polynomials

6.867 Machine Learning

Assume that you have made n different measurements of a quantity x. Usually the results of these measurements will vary; call them x 1

Biostatistics and Design of Experiments Prof. Mukesh Doble Department of Biotechnology Indian Institute of Technology, Madras

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Lecture 4: Training a Classifier

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Chapter Five Notes N P U2C5

Learning with multiple models. Boosting.

Chapter 2: Resampling Maarten Jansen

CS 195-5: Machine Learning Problem Set 1

Functions and Data Fitting

SYDE 112, LECTURE 7: Integration by Parts

Regression: Lecture 2

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

Look before you leap: Some insights into learner evaluation with cross-validation

ECE521 week 3: 23/26 January 2017

Lecture 3: Chapter 3

JUST THE MATHS UNIT NUMBER 1.5. ALGEBRA 5 (Manipulation of algebraic expressions) A.J.Hobson

C-1. Snezana Lawrence

Lecture 2: Review of Probability

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Lecture 3: Introduction to Complexity Regularization

12 Statistical Justifications; the Bias-Variance Decomposition

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

MS&E 226: Small Data

CS 543 Page 1 John E. Boon, Jr.

Notes on Random Vectors and Multivariate Normal

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

Lecture 2 Machine Learning Review

Mathematics 136 Calculus 2 Everything You Need Or Want To Know About Partial Fractions (and maybe more!) October 19 and 21, 2016

STAT Section 2.1: Basic Inference. Basic Definitions

Transcription:

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c04 203/9/9 page 78 le-tex 78 4 Resampling Techniques.2 bootstrap interval bounds 0.8 0.6 0.4 0.2 0 0 200 400 600 800 000 B Figure 4.7 Dependence of percentile and BC a bootstrap intervals (68% CL) on the number of bootstrap samples. The top pair of curves are the estimated upper limit of the interval, and the bottom pair the lower limit. In each pair, the upper curve is from the BC a method, and the lower is from the percentile method. All bootstrap replicas are independent, so the local scatter of the points indicates the statistical variation from the resampling process. Thesampleissize20fromN(0,).Thesamplevarianceis0.568forthissample. Table 4. Comparing coverage of estimated 68% confidence intervals from the bootstrap percentile and BC a methods. Percentile BC a Target tail probability 0.587 Low tail probability 0.0826 0.2093 High tail probability 0.3035 0.889 Target coverage 0.6827 Coverage 0.639 0.608 4.5 Cross-Validation In later chapters we develop a variety of classification algorithms. We can view the typical situation as one where we have a set of understood data that we wish to learn from, and thence make predictions about data with some unknown characteristic. It is important to know how good our prediction is. Another situation occurs when we wish to make a prediction in the familiar regression setting. Again, we wish to know how good our prediction is. We may attempt to provide a measure for this with the expected prediction error (EPE), defined below. The EPE requires knowing the sampling distribution, which is in general not available. Thus, we must find a means to estimate EPE. A technique for doing this is the resampling method known as cross-validation.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c04 203/9/9 page 79 le-tex 4.5 Cross-Validation 79 To define the expected prediction error, consider the regression problem (it could also be a classification problem, as we shall discuss in Section 9.4.) where we wish to predict a value for random variable Y, depending on the value for random variable X. In general, X is a vector of random variables, but we will treat it as one-dimensional for the moment. The joint probability distribution for X and Y is F(x, y), with pdf f (x, y). We wish to find the best prediction for Y,givenanyX D x.that is, we look for a function r(x) providing our prediction for Y. Whatdowemean by best? Well, that is up to us to decide; there are many possibilities. However, the most common choice is an estimator that minimizes the expected squared deviation, and this is how we define the expected (squared) prediction error: Z EPE(r) D Ef[Y r(x )] 2 gd [y r(x)] 2 f (x, y)dxdy. (4.38) Note that the expectation is over both X and Y; it is the expected error (squared) over the joint distribution. The function r that minimizes EPE is the regression function Z r(x) D E(Y ) X Dx D yf(x, y)dy, (4.39) that is, the conditional expectation for Y, givenx D x. Fortunately, this is a nicely intuitive result. We remark again that x and y may be multivariate. How might we estimate the EPE? For a simple case, consider a bivariate dataset D f(x n, Y n ), n D,...,Ng. Suppose we are interested in finding the best straight line fit. In this case, our regression function is `(x) D axc b.weestimate parameters a and b by finding Oa and O b that minimize (assuming equal weights for simplicity) NX nd Y n OaX n O b 2. (4.40) The value of a new sampling is predicted given X NC : OY NC DOaX NC C O b. (4.4) We wish to estimate the EPE for Y NC. A simple approach is to divide our f(x i, Y i ), i D,...,Ng dataset into two pieces, perhaps two halves. Then one piece (the training set) could be used to determine the regression function, and the other piece (the testing set) could be used to estimate the EPE. However, this seems a bit wasteful, since we are only using half of the available data to obtain our regression function, and we could do a better job with all of the data. The next thing that occurs to us is to reverse the roles of the two pieces and somehow average the results, and this is a pretty good idea. But let us take this to an extreme, known as leave-one-out cross-validation.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c04 203/9/9 page 80 le-tex 80 4 Resampling Techniques The algorithm for leave-one-out cross-validation is as follows:. Form N subsets of the dataset D, each one leaving out a different datum, say (X k, Y k ). We will use the subscript k to denote quantities obtained omitting datum (X k, Y k ). Likewise, we let D k be the dataset leaving out (X k, Y k ). 2. Do the regression on dataset D k, obtaining regression function r k. 3. Using this regression predict the value for the missing point: OY k D r k (X k ). (4.42) 4. Repeat this process for k D,...,N. Estimate the EPE according to N NX ( OY k Y k ) 2. (4.43) kd Let us try an example application. Suppose we wish to investigate polynomial regression models for a dataset (perhaps an angular distribution or a background distribution). For example, we have a dataset and wish to consider whether to use the straight line relation Y D ax C b, (4.44) or the quadratic relation Y D ax 2 C bx C c. (4.45) We know that the fitted errors for the quadratic model will always be smaller than for the linear model. A common approach is to compute the sum of squared fitted errors, and apply some criterion on the difference in this quantity between the two models, often resorting to an approximation with a χ 2 distribution. That is, we use the Snedecor F distribution to compare the χ 2 values from fits for the two models. However, we may not wish to rely on the accuracy of this approximation. In this case, cross-validation may be applied. The predictive error is not necessarily smaller with the additional adjustable parameters. We may thus use our estimated prediction errors as a means to decide between models. Suppose, for example, that our data is actually sampled from the linear model, as in the filled circles in Figure 4.8a. We do cross-validation estimates of the prediction error for both the linear and quadratic fit models, and take the difference (linear EPE minus quadratic EPE). The distribution of this difference, for 00 such experiments, is shown in Figure 4.8b. Choosing the linear model when the difference is larger than zero gets it right in 84 out of 00 cases. The MATLAB function crossval is used to perform the estimates, with calls of the form: crossval( mse,x,y, Predfun,@linereg, leaveout,); Alternatively, suppose that our data is sampled from a quadratic model (with a rather small quadratic term), as in the plus symbols in Figure 4.8a. We do cross-

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c04 203/9/9 page 8 le-tex 4.5 Cross-Validation 8 Figure 4.8 Leave-one-out cross-validation example. (a) Data samples, each of size 00, generated according to a linear model Y D X (filled circles) or a quadratic model Y D X C 0.03X 2 (plus symbols). (b) Distribution of linear model minus quadratic model EPE for data generated according to a linear model. (c) Distribution of linear model minus quadratic model EPE for data generated according to a quadratic model. validation estimates of the prediction error for both the linear and quadratic fit models, and again take the difference. The distribution of this difference, for 00 such experiments, is shown in Figure 4.8c. Choosing the quadratic model when the difference is less than zero gets it right in 79 out of 00 cases. Leave-one-out cross-validation is particularly suitable if the size of our dataset is not large. However, as N becomes large, the required computer resources may be prohibitive. We may back off from the extreme represented by leave-one-out cross-validation and obtain K-fold cross-validation. In this case, we divide the dataset into K disjointed subsets, of essentially equal size m N/K. Leave-one-out crossvalidation corresponds to K D N. In K-fold cross-validation, the model is trained on the dataset consisting of everything except the held-out sample of size m. Then the prediction error is obtained by applying the model to the held-out validation sample. This procedure is applied in turn for each of the K held-out samples, and the results for the squared prediction errors averaged. We have a choice for K; how can we decide? To get some insight, we note that there are three relevant issues: computer time, bias, and variance in our estimated prediction error. The choice of K may require compromises among these. In particular, for large datasets, we may be driven to small K by limited computing resources. To understand the bias consideration, we introduce the Learning Curve, illustrated in Figure 4.9. This curve shows how the expected prediction error decreases as the training set size is increased. ) We may use it to understand how different choices of folding can lead to different biases in estimating the EPE. Except for very small dataset sizes, if we use N-fold cross-validation, we use essentially the whole dataset ) It is conventional in the classification problem to show one minus the error rate as the learning curve.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c04 203/9/9 page 82 le-tex 82 4 Resampling Techniques 8 8 Estimated Expected Prediction Error 7 6 5 4 3 2 Estimated Expected Prediction Error 7 6 5 4 3 2 0 0 20 40 60 80 00 (a) Dataset size (b) 0 20 40 60 80 00 Dataset size Figure 4.9 Computing the learning curve for cross-validation. (a) The dependence of estimated predicted error for leave-one-out cross-validation on sample size for several data samples. The data is generated according to the linear model as in Figure 4.8. (b) The learning curve, estimated by averaging together the results from 00 data samples. That is, 00 curves of the form illustrated in the left plot are averaged together. for each evaluation of the EPE, and our estimator is close to unbiased (though as we see in Figure 4.9a, it may have a large variance). However, consider what happens if we go to smaller K values. To make the point, consider a dataset of size 20, and K D 2. In this case, our estimated EPE will be based on a training sample of size 0. Looking at the learning curve, we see that yields an overestimate of the EPE for N D 20. On the other hand, the larger K is, the more computation is required, so the lowest acceptable K ispreferred.amuchmoresubtleissueisthevarianceofthe estimator (Breiman and Spector, 992). As a rule of thumb, it is proposed that K values between 5 and 0 are typically approximately optimal. Performance on our regression example above degrades somewhat from leave-one-out to K D 5 or 0. We will see in the next chapter another application of cross-validation, to the problem of optimizing smoothing in density estimation. 4.6 Resampling Weighted Observations It is not uncommon to deal with datasets in which the samplings are accompanied by a nonnegative weight, w. For example, we may have samples corresponding to different luminosities that we wish to incorporate into an analysis. Hence, we describe our dataset as a set of pairs, (x n, w n I n D,...,N), where w n is the weight associated with the sampling x n (x n could also be a vector). The question arises: How can we apply resampling methods to such a dataset?