Bootstrap metody II Kernelové Odhady Hustot

Similar documents
Markovské řetězce se spojitým parametrem

Základy teorie front II

Statistika pro informatiku

Základy teorie front

Statistika pro informatiku

Statistika pro informatiku

Quantum computing. Jan Černý, FIT, Czech Technical University in Prague. České vysoké učení technické v Praze. Fakulta informačních technologií

Cole s MergeSort. prof. Ing. Pavel Tvrdík CSc. Fakulta informačních technologií České vysoké učení technické v Praze c Pavel Tvrdík, 2010

Bootstrap, Jackknife and other resampling methods

Computational intelligence methods

Branch-and-Bound Algorithm. Pattern Recognition XI. Michal Haindl. Outline

Chapter 2: Resampling Maarten Jansen

Computational Intelligence Methods

Confidence Intervals Unknown σ

Confidence intervals for kernel density estimation

The Nonparametric Bootstrap

STAT 830 Non-parametric Inference Basics

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Analytical Bootstrap Methods for Censored Data

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Lecture 12: Small Sample Intervals Based on a Normal Population Distribution

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Pivotal Quantities. Mathematics 47: Lecture 16. Dan Sloughter. Furman University. March 30, 2006

One-Sample Numerical Data

Importance sampling in scenario generation

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Set Theory. Pattern Recognition III. Michal Haindl. Set Operations. Outline

MVE055/MSG Lecture 8

MI-RUB Testing Lecture 10

MI-RUB Testing II Lecture 11

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Feature Selection. Pattern Recognition X. Michal Haindl. Feature Selection. Outline

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Comparing Systems Using Sample Data

Statistics. Statistics

Lecture 13: Subsampling vs Bootstrap. Dimitris N. Politis, Joseph P. Romano, Michael Wolf

A Very Brief Summary of Statistical Inference, and Examples

Nonparametric Methods II

Distributed Estimation, Information Loss and Exponential Families. Qiang Liu Department of Computer Science Dartmouth College

Post-exam 2 practice questions 18.05, Spring 2014

Contents 1. Contents

Permutation Tests. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

Confidence Intervals in Ridge Regression using Jackknife and Bootstrap Methods

Interval Estimation III: Fisher's Information & Bootstrapping

Better Bootstrap Confidence Intervals

Resampling and the Bootstrap

4 Resampling Methods: The Bootstrap

STAT 512 sp 2018 Summary Sheet

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

Quantitative Economics for the Evaluation of the European Policy. Dipartimento di Economia e Management

Confidence intervals for parameters of normal distribution.

Lecture 8. October 22, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

NONPARAMETRIC DENSITY ESTIMATION WITH RESPECT TO THE LINEX LOSS FUNCTION

Inference For High Dimensional M-estimates. Fixed Design Results

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

Econ 582 Nonparametric Regression

A Resampling Method on Pivotal Estimating Functions

Bootstrap & Confidence/Prediction intervals

Non-parametric Inference and Resampling

Spring 2012 Math 541B Exam 1

Bootstrap Confidence Intervals

Inference on distributions and quantiles using a finite-sample Dirichlet process

Bootstrap. Director of Center for Astrostatistics. G. Jogesh Babu. Penn State University babu.

Estimating a population mean

1 Statistical inference for a population mean

2WB05 Simulation Lecture 7: Output analysis

Binary Decision Diagrams

MIT Spring 2015

CS 147: Computer Systems Performance Analysis

12 - Nonparametric Density Estimation

ST 371 (IX): Theories of Sampling Distributions

Practice Problems Section Problems

Chapter 11. Output Analysis for a Single Model Prof. Dr. Mesut Güneş Ch. 11 Output Analysis for a Single Model

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Cramér-Type Moderate Deviation Theorems for Two-Sample Studentized (Self-normalized) U-Statistics. Wen-Xin Zhou

Accuracy & confidence

Sensitivity Analysis with Correlated Variables

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

IEOR E4703: Monte-Carlo Simulation

Introduction to Self-normalized Limit Theory

Percentage point z /2

Business Statistics: A Decision-Making Approach 6 th Edition. Chapter Goals

Tutorial on Markov Chain Monte Carlo Simulations and Their Statistical Analysis (in Fortran)

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

A note on multiple imputation for general purpose estimation

Introduction to Probability and Statistics (Continued)

36. Multisample U-statistics and jointly distributed U-statistics Lehmann 6.1

Using R in Undergraduate and Graduate Probability and Mathematical Statistics Courses*

Ch. 7. One sample hypothesis tests for µ and σ

Outline. Confidence intervals More parametric tests More bootstrap and randomization tests. Cohen Empirical Methods CS650

The Analysis of Uncertainty of Climate Change by Means of SDSM Model Case Study: Kermanshah

Point and Interval Estimation II Bios 662

Maximum Likelihood Large Sample Theory

L09. PARTICLE FILTERING. NA568 Mobile Robotics: Methods & Algorithms

Resampling and the Bootstrap

10/8/2014. The Multivariate Gaussian Distribution. Time-Series Plot of Tetrode Recordings. Recordings. Tetrode

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Transcription:

Bootstrap metody II Kernelové Odhady Hustot Mgr. Rudolf B. Blažek, Ph.D. prof. RNDr. Roman Kotecký, DrSc. Katedra počítačových systémů Katedra teoretické informatiky Fakulta informačních technologií České vysoké učení technické v Praze Rudolf Blažek & Roman Kotecký, 2011 Statistika pro informatiku MI-SPI, ZS 2011/12, Přednáška 24 Evropský sociální fond Praha & EU: Investujeme do vaší budoucnos@

Bootstrap methods II Kernel Estimates Mgr. Rudolf B. Blažek, Ph.D. prof. RNDr. Roman Kotecký, DrSc. Department of Computer Systems Department of Theoretical Informatics Faculty of Information Technologies Czech Technical University in Prague Rudolf Blažek & Roman Kotecký, 2011 Statistics for Informatics MI-SPI, ZS 2011/12, Lecture 24 The European Social Fund Prague & EU: We Invest in Your Future

Classical Confidence Intervals Confidence Interval for the Mean μ Approx. distribution from CLT, exact for Gaussian Xi Z = X n µ / p n N(0, 1) α/2 1 α α/2 0-2 2 -zα/2 zα/2 3

Classical Confidence Intervals Confidence Interval for the Mean μ Exact distribution for Gaussian Xi T = X n µ s/ p n t(n 1) α/2 1 α α/2 -tα/2,n-1 0-2.2 2.2 tα/2,n-1 4

Classical Confidence Intervals Confidence Interval for the Mean μ P( X n µ < z /2 / n) (1 ) X n N(µ, 2 /n) α/2 1 α α/2 μ zα/2-2σ 0μ μ+ zα/2 2 σ / p n / p n 5

Classical Confidence Intervals Confidence Interval for the Mean μ We have obtained P( X n µ < z /2 / n) (1 ) Therefore we can construct a confidence interval for μ P µ X n ± z /2 / n (1 ) If σ is unknown, then we will estimate it by s and use the Student t-distribution with n-1 degrees of freedom P µ X n ± t /2,n 1 s/ n (1 ) 6

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x Histogram of xbar (average of 50 random values) 2.938929 3.941071 0.0 0.5 1.0 1.5 95% CI: about 1 in 20 miss μ = 3.5 1 2 3 μ = 3.5 4 5 6 xbar 7

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.055486 4.024514 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 8

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.023963 4.096037 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 9

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 2.750286 3.769714 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 10

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.425662 4.374338 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 11

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.806393 4.753607 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 12

Bootstrap Methods (Resampling Techniques) Bootstrap metody Statistika pro informatiku MI-SPI ZS 2011/12, Přednáška 23 13

Literature Textbook Jun Shao & Dongsheng Tu The Jackknife and Bootstrap Springer Series in Statistics 1st ed. Jul 21, 1995 ISBN-10: 0387945156 ISBN-13: 978-0387945156 14

Introduction Classical Approach Random Sample Mean & Std. Deviation Confidence Interval based on Gaussian Approximation Information Loss: n values 2 values Mean k1 s.d. Sample Mean Mean + k2 s.d. -1-0.3 0 0.71 1 1.72 2 15

Introduction Central Limit Theorem Gaussian Approximation Needs finite 2nd moment Needs large n -1-0.3 0 0.71 1 1.72 2 16

Introduction Bootstrap Resampling Resampling: Monte-Carlo from the Histogram: Estimates the distribution No information loss histogram -1-0.3 0 0.71 1 1.72 2 3 17

Bootstrap Applications Permutation bootstrap Leads to Permutation Tests Used to train Change-Point Detection for Network Intrusions Bootstrap in Random Processes Resampling of inter-arrival times Improves test accuracy 18

Bootstrap-t Confidence Intervals The Bootstrap Method Algorithm Let X1, X2, X3,..., Xn be i.i.d. (independent & identically distributed) random variables with a distribution function F. Assume that we want to estimate a parameter θ of F. ˆ ( (...(a point estimator of the population parameter θ ˆ 2 n ˆ... an estimator of the variance of The bootstrap-t method (Efron, 1982) is based on a studentized pivot R n = If the distribution of Rn is unknown, we will use resampling. ˆ ˆ n 19

Bootstrap-t Confidence Intervals The Bootstrap Method Example For example θ could be the mean μ of the distribution. The point estimator and its variance would then be X n = 1 n nx X i Var X n = 2 /n with Var X i = 2 i=1 The studentized pivotal quantity is R n = X n µ s/ n and the estimate of Var X n is ˆ 2 n = s2 /n (where s 2 is the sample variance). 20

Bootstrap-t Confidence Intervals The Bootstrap Method Example For example θ could be the mean μ of the distribution. The point estimator and its variance would then be X n = 1 n nx i=1 The classical confidence interval is based on the CLT X n N(µ, Estimator of σ X i Var X n = 2 /n with Var X i = 2 2 /n), Z = X n µ R n = X µ n s/ p n / p n N(0, 1) Student-t(n 1) (at least approximately) 21

Bootstrap-t Confidence Intervals Confidence Interval for the Mean μ The classical confidence interval for μ is either of P µ X n ± z /2 / n (1 ) P µ X n ± t /2,n 1 s/ n (1 ) The CI can be rewritten as SE(X n )= X n k 1 SE(X n ), X n + k 2 SE(X n ) p Var X n = / p n is the standard error of X n 22

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for the mean μ X n k 1 SE(X n ), X n + k 2 SE(X n ) is based on P X n k 1 SE(X n ) apple µ apple X n + k 2 SE(X n ) = P k 1 SE(X n ) apple µ X n apple k 2 SE(X n ) = P k 2 apple X n µ SE(X n ) apple k 1 1 23

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for the mean μ X n k 1 SE(X n ), X n + k 2 SE(X n ) is based on P k 2 apple X n µ SE(X n ) apple k 1 1 X n µ / p N(0, 1) R n = X µ n n s/ p n (at least approximately) % X n... the point p estimator of μ SE(X n )= Var X n = Student-t(n 1) / p n can be estimated by s/ p n 24

Bootstrap-t Confidence Intervals Confidence Interval for the Mean μ The distribution is known using the CLT Z = X n µ / p n N(0, 1) α/2 1 α α/2 0-2 2 -k2=-zα/2 k1=zα/2 25

Bootstrap-t Confidence Intervals Confidence Interval for the Mean μ The distribution is known using the CLT R n = X n µ s/ n t(n 1) α/2 1 α α/2 0-2.2 2.2 -k2=-tα/2,n-1 k1=tα/2,n-1 If the distribution of Rn is unknown, we will use resampling. 26

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for the mean μ is based on P X n k 1 SE(X n ), X n + k 2 SE(X n ) k 2 apple X n µ SE(X n ) apple k 1 1 The general form of a confidence interval for a parameter θ ˆ k 1 SE ˆ, ˆ + k 2 SE ˆ will similarly be based on P ˆ k 2 apple SE( ˆ ) apple k 1 1 27

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for a parameter θ ˆ k 1 SE ˆ, ˆ + k 2 SE ˆ is based on P ˆ k 2 apple SE( ˆ ) apple k 1 1 where ˆ is theppoint estimator of SE ˆ = Var ˆ is the standard error of ˆ k1, k2 are selected so that the coverage probability is 1 α Steps:( 1. The standard error is estimated from the data ( ( ( 2. k1, k2 are estimated using resampling of the data 28

Bootstrap-t Confidence Intervals The Bootstrap Method Algorithm The bootstrap-t method (Efron, 1982) is based on a studentized pivot R n = If the distribution of Rn is unknown we will use resampling: X1, X2, X3,..., Xn is the original i.i.d. sample from distribution function F. Assume that ˆF is an estimator of the distribution function F (parametric or non-parametric) Let X * 1, X * 2, X * 3,..., X * n be a new i.i.d. sample from ˆ n ˆ n ˆF 29

Bootstrap-t Confidence Intervals The Bootstrap Method Algorithm X * 1, X * 2, X * 3,..., X * n is a new i.i.d. sample from the original data (i.e. resampling with replacement) R n = ˆ n R n = ˆ n ˆ n ˆ n ˆ n Resampling is repeated, and the R * n are sorted by size. α/2 100% of smallest and largest values are discarded. These cut-off points are used as the quantiles in the CI ˆ n k 1 ˆ n, ˆ n + k 2 ˆ n 30

Bootstrap-t Confidence Intervals The Bootstrap Method Example Let X1, X2, X3,..., Xn be i.i.d. random variables from log-normal distribution with parameters μ and σ. That is ln(xi) are i.i.d ~ N(μ, σ 2 ). The log-normal pdf is 1 (ln x µ) 2 f (x) = p x 2 exp 2 2, x > 0 Goal: find a confidence interval for the median Point estimator:( ( ( ( ( with variance: ˆ = X n = e µ SE 2 ˆ = Var ˆ = 2 n =(e 2 /n 1)e 2µ+ 2 /n 31

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 n -10-5 0 5 10 15 20 32

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 CLT approximation n -10-5 0 5 10 15 20 33

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 CLT approximation of R n = n ˆ n n -10-5 0 5 10 15 20 34

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 ˆ n -30-20 -10 0 10 20 35

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 CLT approximation ˆ n -30-20 -10 0 10 20 36

Bootstrap-t Confidence Intervals The Bootstrap Method 95% CI for e µ : ˆ k 1 SE ˆ, ˆ + k 2 SE ˆ 0.00 0.05 0.10 0.15 ˆ = 0.707 95% CI: (-1.22,1.15) SE ˆ = 0.1002 -k1 = -19.24 k2 = 4.42 α/2=0.025 1-α=0.95 α/2=0.025-30 -20-10 0 10 20 37

Kernel Estimators Kernel Estimators 38

Kernel Estimators Kernel Estimators Algorithm% % % % % % % % % Kernelový odhad hustoty Let X1, X2, X3,..., Xn be i.i.d. (independent & identically distributed) random variables with a density function f. A kernel density estimator (of the density f ) is nx nx ˆf h (x) = 1 K h (x x i ) = 1 x K n nh h i=1 where K is a kernel and h is a smoothing parameter. A common choice of the kernel is the Gaussian density. The bandwidth h is selection is a non-trivial task. i=1 xi 39

Kernel Estimators Kernel Estimators Algorithm Selection of the bandwidth h based on L2 optimality: ( Use h that minimizes the mean integrated squared error MISE(h) =E Z 1 1 ˆf h (x) f (x) 2 dx Sometimes h is changed adaptively. 40

Kernel Estimators Histogram of x 0.00 0.10 0.20 5 10 15 x Kernel Estimate 0.00 0.10 0.20 5 10 15 N = 3500 Bandwidth = 0.4458 41

Kernel Estimators Kernel Estimate 0.00 0.10 0.20 5 10 15 N = 3500 Bandwidth = 0.4458 Kernel Estimate 0.00 0.10 0.20 5 10 15 N = 3500 Bandwidth = 0.05 42

Kernel Estimators Kernel Estimate 0.0 0.6 1.2-1 0 1 2 3 4 5 6 N = 3 Bandwidth = 0.1 Kernel Estimate 0.00 0.10-1 0 1 2 3 4 5 6 N = 3 Bandwidth = 2 Kernel Estimate 0.00 0.15-1 0 1 2 3 4 5 6 N = 3 Bandwidth = 0.8087 43