Bootstrap metody II Kernelové Odhady Hustot

Bootstrap metody II Kernelové Odhady Hustot Mgr. Rudolf B. Blažek, Ph.D. prof. RNDr. Roman Kotecký, DrSc. Katedra počítačových systémů Katedra teoretické informatiky Fakulta informačních technologií České vysoké učení technické v Praze Rudolf Blažek & Roman Kotecký, 2011 Statistika pro informatiku MI-SPI, ZS 2011/12, Přednáška 24 Evropský sociální fond Praha & EU: Investujeme do vaší budoucnos@

Bootstrap methods II Kernel Estimates Mgr. Rudolf B. Blažek, Ph.D. prof. RNDr. Roman Kotecký, DrSc. Department of Computer Systems Department of Theoretical Informatics Faculty of Information Technologies Czech Technical University in Prague Rudolf Blažek & Roman Kotecký, 2011 Statistics for Informatics MI-SPI, ZS 2011/12, Lecture 24 The European Social Fund Prague & EU: We Invest in Your Future

Classical Confidence Intervals Confidence Interval for the Mean μ Approx. distribution from CLT, exact for Gaussian Xi Z = X n µ / p n N(0, 1) α/2 1 α α/2 0-2 2 -zα/2 zα/2 3

Classical Confidence Intervals Confidence Interval for the Mean μ Exact distribution for Gaussian Xi T = X n µ s/ p n t(n 1) α/2 1 α α/2 -tα/2,n-1 0-2.2 2.2 tα/2,n-1 4

Classical Confidence Intervals Confidence Interval for the Mean μ P( X n µ < z /2 / n) (1 ) X n N(µ, 2 /n) α/2 1 α α/2 μ zα/2-2σ 0μ μ+ zα/2 2 σ / p n / p n 5

Classical Confidence Intervals Confidence Interval for the Mean μ We have obtained P( X n µ < z /2 / n) (1 ) Therefore we can construct a confidence interval for μ P µ X n ± z /2 / n (1 ) If σ is unknown, then we will estimate it by s and use the Student t-distribution with n-1 degrees of freedom P µ X n ± t /2,n 1 s/ n (1 ) 6

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x Histogram of xbar (average of 50 random values) 2.938929 3.941071 0.0 0.5 1.0 1.5 95% CI: about 1 in 20 miss μ = 3.5 1 2 3 μ = 3.5 4 5 6 xbar 7

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.055486 4.024514 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 8

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.023963 4.096037 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 9

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 2.750286 3.769714 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 10

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.425662 4.374338 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 11

Classical Confidence Intervals Student-t CI for the Mean of a Die Histogram of x 0.00 0.10 1 2 3 μ = 3.5 4 5 6 x 0.0 0.5 1.0 1.5 Histogram of xbar (average of 50 random values) 3.806393 4.753607 1 2 3 μ = 3.5 4 5 6 xbar 95% CI: about 1 in 20 miss μ = 3.5 12

Bootstrap Methods (Resampling Techniques) Bootstrap metody Statistika pro informatiku MI-SPI ZS 2011/12, Přednáška 23 13

Literature Textbook Jun Shao & Dongsheng Tu The Jackknife and Bootstrap Springer Series in Statistics 1st ed. Jul 21, 1995 ISBN-10: 0387945156 ISBN-13: 978-0387945156 14

Introduction Classical Approach Random Sample Mean & Std. Deviation Confidence Interval based on Gaussian Approximation Information Loss: n values 2 values Mean k1 s.d. Sample Mean Mean + k2 s.d. -1-0.3 0 0.71 1 1.72 2 15

Introduction Central Limit Theorem Gaussian Approximation Needs finite 2nd moment Needs large n -1-0.3 0 0.71 1 1.72 2 16

Introduction Bootstrap Resampling Resampling: Monte-Carlo from the Histogram: Estimates the distribution No information loss histogram -1-0.3 0 0.71 1 1.72 2 3 17

Bootstrap Applications Permutation bootstrap Leads to Permutation Tests Used to train Change-Point Detection for Network Intrusions Bootstrap in Random Processes Resampling of inter-arrival times Improves test accuracy 18

Bootstrap-t Confidence Intervals The Bootstrap Method Algorithm Let X1, X2, X3,..., Xn be i.i.d. (independent & identically distributed) random variables with a distribution function F. Assume that we want to estimate a parameter θ of F. ˆ ( (...(a point estimator of the population parameter θ ˆ 2 n ˆ... an estimator of the variance of The bootstrap-t method (Efron, 1982) is based on a studentized pivot R n = If the distribution of Rn is unknown, we will use resampling. ˆ ˆ n 19

Bootstrap-t Confidence Intervals The Bootstrap Method Example For example θ could be the mean μ of the distribution. The point estimator and its variance would then be X n = 1 n nx X i Var X n = 2 /n with Var X i = 2 i=1 The studentized pivotal quantity is R n = X n µ s/ n and the estimate of Var X n is ˆ 2 n = s2 /n (where s 2 is the sample variance). 20

Bootstrap-t Confidence Intervals The Bootstrap Method Example For example θ could be the mean μ of the distribution. The point estimator and its variance would then be X n = 1 n nx i=1 The classical confidence interval is based on the CLT X n N(µ, Estimator of σ X i Var X n = 2 /n with Var X i = 2 2 /n), Z = X n µ R n = X µ n s/ p n / p n N(0, 1) Student-t(n 1) (at least approximately) 21

Bootstrap-t Confidence Intervals Confidence Interval for the Mean μ The classical confidence interval for μ is either of P µ X n ± z /2 / n (1 ) P µ X n ± t /2,n 1 s/ n (1 ) The CI can be rewritten as SE(X n )= X n k 1 SE(X n ), X n + k 2 SE(X n ) p Var X n = / p n is the standard error of X n 22

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for the mean μ X n k 1 SE(X n ), X n + k 2 SE(X n ) is based on P X n k 1 SE(X n ) apple µ apple X n + k 2 SE(X n ) = P k 1 SE(X n ) apple µ X n apple k 2 SE(X n ) = P k 2 apple X n µ SE(X n ) apple k 1 1 23

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for the mean μ X n k 1 SE(X n ), X n + k 2 SE(X n ) is based on P k 2 apple X n µ SE(X n ) apple k 1 1 X n µ / p N(0, 1) R n = X µ n n s/ p n (at least approximately) % X n... the point p estimator of μ SE(X n )= Var X n = Student-t(n 1) / p n can be estimated by s/ p n 24

Bootstrap-t Confidence Intervals Confidence Interval for the Mean μ The distribution is known using the CLT Z = X n µ / p n N(0, 1) α/2 1 α α/2 0-2 2 -k2=-zα/2 k1=zα/2 25

Bootstrap-t Confidence Intervals Confidence Interval for the Mean μ The distribution is known using the CLT R n = X n µ s/ n t(n 1) α/2 1 α α/2 0-2.2 2.2 -k2=-tα/2,n-1 k1=tα/2,n-1 If the distribution of Rn is unknown, we will use resampling. 26

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for the mean μ is based on P X n k 1 SE(X n ), X n + k 2 SE(X n ) k 2 apple X n µ SE(X n ) apple k 1 1 The general form of a confidence interval for a parameter θ ˆ k 1 SE ˆ, ˆ + k 2 SE ˆ will similarly be based on P ˆ k 2 apple SE( ˆ ) apple k 1 1 27

Bootstrap-t Confidence Intervals Confidence Interval for a Parameter θ The CI for a parameter θ ˆ k 1 SE ˆ, ˆ + k 2 SE ˆ is based on P ˆ k 2 apple SE( ˆ ) apple k 1 1 where ˆ is theppoint estimator of SE ˆ = Var ˆ is the standard error of ˆ k1, k2 are selected so that the coverage probability is 1 α Steps:( 1. The standard error is estimated from the data ( ( ( 2. k1, k2 are estimated using resampling of the data 28

Bootstrap-t Confidence Intervals The Bootstrap Method Algorithm The bootstrap-t method (Efron, 1982) is based on a studentized pivot R n = If the distribution of Rn is unknown we will use resampling: X1, X2, X3,..., Xn is the original i.i.d. sample from distribution function F. Assume that ˆF is an estimator of the distribution function F (parametric or non-parametric) Let X * 1, X * 2, X * 3,..., X * n be a new i.i.d. sample from ˆ n ˆ n ˆF 29

Bootstrap-t Confidence Intervals The Bootstrap Method Algorithm X * 1, X * 2, X * 3,..., X * n is a new i.i.d. sample from the original data (i.e. resampling with replacement) R n = ˆ n R n = ˆ n ˆ n ˆ n ˆ n Resampling is repeated, and the R * n are sorted by size. α/2 100% of smallest and largest values are discarded. These cut-off points are used as the quantiles in the CI ˆ n k 1 ˆ n, ˆ n + k 2 ˆ n 30

Bootstrap-t Confidence Intervals The Bootstrap Method Example Let X1, X2, X3,..., Xn be i.i.d. random variables from log-normal distribution with parameters μ and σ. That is ln(xi) are i.i.d ~ N(μ, σ 2 ). The log-normal pdf is 1 (ln x µ) 2 f (x) = p x 2 exp 2 2, x > 0 Goal: find a confidence interval for the median Point estimator:( ( ( ( ( with variance: ˆ = X n = e µ SE 2 ˆ = Var ˆ = 2 n =(e 2 /n 1)e 2µ+ 2 /n 31

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 n -10-5 0 5 10 15 20 32

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 CLT approximation n -10-5 0 5 10 15 20 33

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 CLT approximation of R n = n ˆ n n -10-5 0 5 10 15 20 34

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 ˆ n -30-20 -10 0 10 20 35

Bootstrap-t Confidence Intervals The Bootstrap Method Histogram of R n = ˆ n ˆ n 0.00 0.05 0.10 0.15 CLT approximation ˆ n -30-20 -10 0 10 20 36

Bootstrap-t Confidence Intervals The Bootstrap Method 95% CI for e µ : ˆ k 1 SE ˆ, ˆ + k 2 SE ˆ 0.00 0.05 0.10 0.15 ˆ = 0.707 95% CI: (-1.22,1.15) SE ˆ = 0.1002 -k1 = -19.24 k2 = 4.42 α/2=0.025 1-α=0.95 α/2=0.025-30 -20-10 0 10 20 37

Kernel Estimators Kernel Estimators 38

Kernel Estimators Kernel Estimators Algorithm% % % % % % % % % Kernelový odhad hustoty Let X1, X2, X3,..., Xn be i.i.d. (independent & identically distributed) random variables with a density function f. A kernel density estimator (of the density f ) is nx nx ˆf h (x) = 1 K h (x x i ) = 1 x K n nh h i=1 where K is a kernel and h is a smoothing parameter. A common choice of the kernel is the Gaussian density. The bandwidth h is selection is a non-trivial task. i=1 xi 39

Kernel Estimators Kernel Estimators Algorithm Selection of the bandwidth h based on L2 optimality: ( Use h that minimizes the mean integrated squared error MISE(h) =E Z 1 1 ˆf h (x) f (x) 2 dx Sometimes h is changed adaptively. 40

Kernel Estimators Histogram of x 0.00 0.10 0.20 5 10 15 x Kernel Estimate 0.00 0.10 0.20 5 10 15 N = 3500 Bandwidth = 0.4458 41

Kernel Estimators Kernel Estimate 0.00 0.10 0.20 5 10 15 N = 3500 Bandwidth = 0.4458 Kernel Estimate 0.00 0.10 0.20 5 10 15 N = 3500 Bandwidth = 0.05 42

Kernel Estimators Kernel Estimate 0.0 0.6 1.2-1 0 1 2 3 4 5 6 N = 3 Bandwidth = 0.1 Kernel Estimate 0.00 0.10-1 0 1 2 3 4 5 6 N = 3 Bandwidth = 2 Kernel Estimate 0.00 0.15-1 0 1 2 3 4 5 6 N = 3 Bandwidth = 0.8087 43