Lecture 6 Exploratory data analysis Point and interval estimation

Similar documents
Contents 1. Contents

One-Sample Numerical Data

MIT Spring 2015

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

Robustness and Distribution Assumptions

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

First steps of multivariate data analysis

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

BTRY 4090: Spring 2009 Theory of Statistics

Statistics. Statistics

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Introductory Statistics with R: Simple Inferences for continuous data

Learning Objectives for Stat 225

Topic 15: Simple Hypotheses

MSc / PhD Course Advanced Biostatistics. dr. P. Nazarov

(Re)introduction to Statistics Dan Lizotte

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Instrumentation (cont.) Statistics vs. Parameters. Descriptive Statistics. Types of Numerical Data

CSE 312 Final Review: Section AA

Contents. Acknowledgments. xix

Unit 2. Describing Data: Numerical

Math Review Sheet, Fall 2008

F79SM STATISTICAL METHODS

Statistical Data Analysis

STAT 461/561- Assignments, Year 2015

The Union and Intersection for Different Configurations of Two Events Mutually Exclusive vs Independency of Events

f (1 0.5)/n Z =

Chapter 2. Discrete Distributions

Descriptive Univariate Statistics and Bivariate Correlation

Subject CS1 Actuarial Statistics 1 Core Principles

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Continuous Distributions

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Statistical Inference

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Stat 135, Fall 2006 A. Adhikari HOMEWORK 6 SOLUTIONS

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES


Glossary for the Triola Statistics Series

Practice Problems Section Problems

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

AP Statistics Cumulative AP Exam Study Guide

This does not cover everything on the final. Look at the posted practice problems for other topics.

Business Statistics. Lecture 10: Course Review

MATH4427 Notebook 2 Fall Semester 2017/2018

Ch. 1: Data and Distributions

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Unit 2: Numerical Descriptive Measures

After completing this chapter, you should be able to:

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Empirical Likelihood

Chapter 3. Data Description

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

simple if it completely specifies the density of x

Bootstrapping Spring 2014

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Describing Distributions with Numbers

Hypothesis testing: theory and methods

CSE 312: Foundations of Computing II Quiz Section #10: Review Questions for Final Exam (solutions)

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Common Discrete Distributions

MATH4427 Notebook 4 Fall Semester 2017/2018

Deccan Education Society s FERGUSSON COLLEGE, PUNE (AUTONOMOUS) SYLLABUS UNDER AUTOMONY. SECOND YEAR B.Sc. SEMESTER - III

HT Introduction. P(X i = x i ) = e λ λ x i

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

CHAPTER 1. Introduction

Statistics I Chapter 2: Univariate data analysis

Loglikelihood and Confidence Intervals

Introduction to Error Analysis

MAT Mathematics in Today's World

Statistical Computing with R MATH , Set 6 (Monte Carlo Methods in Statistical Inference)

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Robust statistics. Michael Love 7/10/2016

Introduction and Single Predictor Regression. Correlation

Unit 14: Nonparametric Statistical Methods

Statistics I Chapter 2: Univariate data analysis

Descriptive Data Summarization

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Stat 427/527: Advanced Data Analysis I

Probability and Estimation. Alan Moses

IE 230 Probability & Statistics in Engineering I. Closed book and notes. 120 minutes.

1. Exploratory Data Analysis

Mathematical statistics

Overview. Confidence Intervals Sampling and Opinion Polls Error Correcting Codes Number of Pet Unicorns in Ireland

BNG 495 Capstone Design. Descriptive Statistics

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

Elements of statistics (MATH0487-1)

Class 26: review for final exam 18.05, Spring 2014

Basic Concepts of Inference

Chapter 2 Descriptive Statistics

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

20 Hypothesis Testing, Part I

Exploratory data analysis

Package jmuoutlier. February 17, 2017

Nonparametric hypothesis tests and permutation tests

Lecture 12: Small Sample Intervals Based on a Normal Population Distribution

Transcription:

Lecture 6 Exploratory data analysis Point and interval estimation Dr. Wim P. Krijnen Lecturer Statistics University of Groningen Faculty of Mathematics and Natural Sciences Johann Bernoulli Institute for Mathematics and Computer Science October 26, 2010

Lecture overview Exploratory data analysis Numerical summaries mean median measures of association between variable (Pearson product moment correlation, Spearman s rank correlation coefficient) brief overview of exploratory data visualizations Histogram and density plot Quantile-Quantile plot Empirical cumulative distribution function Box-and-wiskers-plot Point estimation by Maximum Likelihood Interval estimation 2

Exploratory data analysis generates hypotheses (inductive) let the data speak thick well on data you (don t) have Assumptions: 1. random samples 2. finite variance 3. population density unchanged under sampling 3

Numerical summaries sample X 1,, X n (rv) has realizations x 1,, x n ( R) sample mean X = 1 n n X i (= r.v.) is random variable with distribution sample mean with size n possibly infinite µ = x = 1 n n x i (= fixed number) is fixed number without distribution; always E[X] = µ popu- sample sample statistic lation estimator (rv) estimate (fixed) mean µ X = 1 n n X i x = 1 n n x i variance σ 2 S 2 = 1 n n 1 (X i X) 2 s 2 = 1 n n 1 (x i x) 2 4

Determinations of copper in wholemeal flour chem: 24 determinations of copper in wholemeal flour (ppm) Large study suggests µ = 3.68 (Venables & Ripley, 2002) Median = middle value of data (50% >, 50% <) trimmed mean = mean leaving out percentage of extreme data > library(mass) > c(mean(chem),median(chem)) [1] 4.280417 3.385000 > x <- sort(chem, decreasing=true, index.return=true) > x$x [1] 28.95 5.28 3.77 3.70 3.70 3.70 3.70 3.60 3 [13] 3.37 3.10 3.03 3.03 2.90 2.80 2.70 2.50 2 > plot(x$x) > mean(chem, trim = 1/24)#exclude smallest, largest [1] 3.253636 > x$x [1] 28.95 5.28 3.77 3.70 3.70 3.70 3.70 3.60 3 [13] 3.37 3.10 3.03 3.03 2.90 2.80 2.70 2.50 2 > mean(x$x[2:23]) #= 3.253636

sorted chem data 6

Measure of spread of data Range = largest minus smallest Sample variance= S 2 = 1 n n 1 (X i X) 2 Interquartile Range (IQR)= upper quartile - lower quartile lower/upper quartile have 25% / 75% lower values > range(chem) [1] 2.20 28.95 > c(var(chem),var(x$x[2:23])) [1] 28.0624042 0.4440338 #great difference! > summary(chem) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.200 2.775 3.385 4.280 3.700 28.950 > IQR(chem) [1] 0.925 > quantile(chem,3/4) - quantile(chem,1/4) 0.925 7

Chebyshev and empirical rules P ( X µ < kσ) 1 1 k 2 probability is at least 1 1/k 2 that X takes value k standard deviations from the mean it is general, but often imprecise Empirical rule for approximately normal data 68% of observations within 1 standard deviation from mean 95% of observations within 2 sd from mean 99.7% of observations within 3 sd from mean > pnorm(1) - pnorm(-1) [1] 0.6826895 > pnorm(2) - pnorm(-2) [1] 0.9544997 8

Measures of association between variables Correlation coefficient: measure of strength of linear relationship (Pearson) ρ = COV (X, Y ) V [X] V [Y ] = E[(X µ X )(Y µ Y )] E[(X µx ) 2 ] E[(Y µ Y ) 2 ] Properties ρ = n (x i x i )(y j y j ) n (x i x i ) 2. n (y j y j ) 2 1 ρ 1; bounded measure of linear relationship if ρ = ±1 there are a and b: Y = ax + b ρ > 0 both X, Y increase/decrease ρ not robust against outliers under normality ρ measures stochastic dependence; ρ = 0 independence ρ is symmetric ρ(x, Y ) = ρ(y, X) 9

Teaching Demonstrations Interactive graphical visualizations of correlation coefficient Minimize other screens and interactively use Tk slider library(teachingdemos) run.cor2.examp(n=500, wait=false) sensitivity to outliers: put a few points in small circle, add one far away put.points.demo() Conclusion: extreme outlier can have large influence on (non-robust) statistic 10

Spearman s rank correlation coefficient in case of outliers: use rank correlation coefficient ρ = 12 n(n 1)(n + 1) n ( rank(x i ) n + 1 ) ( rank(y i ) n + 1 ) 2 2 Example: Is there a correlation between Hunter s L measure of lightness (x) to the averages of consumer panel scores averaged over 80 (y) for 9 lots of canned tuna. > x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) > y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8) 11

Assessment of tuna quality (Hollander & Wolfe, 1973) > n <- length(x) > sumr <- sum((rank(x)-(n+1)/2)*(rank(y)-(n+1)/2)) > (rhohat <- 12 * sumr /(n*(n-1)*(n+1))) [1] 0.6 #value of spearman rho > cor.test(x,y,method = "spearman") Spearman s rank correlation rho data: x and y S = 48, p-value = 0.0968 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.6 ρ has asymptotically normal distribution (CLT) p-value Conclusion: H 0 : ρ = 0 not rejected 12

Comparing Pearson p.m.c.c with Spearman s r.c.c. set.seed(110) x <- rnorm(15); y <- rnorm(15) #rho = 0 x[16] <- 10; y[16] <- 10 > c(cor.test(x,y,method = "pearson")$estimate, + cor.test(x,y,method = "spearman")$estimate) cor rho 0.8768005 0.3676471 Spearman rank correlation coefficient ρ more robust against outliers than Pearson correlation coefficient in case a suspicion: check for differences by computation or plotting assumption for normality does not hold 13

Basic visualizations of univariate data sets Histogram: estimates the density by presenting (relative) frequencies in consecutive intervals (bins) as height of bars (hist) Density plot: smooth graph representing estimated proportions per bin (plot(density(x))) Quantile-Quantile plot: represents as points the quantiles of the first distribution (x-coordinate) against the same quantile of the second (theoretical) distribution; all points on y = x line imply perfect match (qqplot) Empirical cumulative distribution function: step function Fn jumps i/n at observation values, where i is the number of tied observations at that value (plot(ecdf) Box-and-wiskers-plot: box between Q1 and Q3, with line segment for median Q 2, whiskers for minimum and maximum 14

R code for basic visualizations par(mfrow = c(2, 2)) x <- rnorm(100) hist(x,freq=false) qqnorm(x); qqline(x) plot(ecdf(x)) boxplot(x) par(mfrow = c(1, 1))

Illustrations of Box-and-whisker-plot five-number summary of data: minimum, first quartile Q 1, median Q 2, third quartile Q 3, maximum. box between Q1 and Q3, whisker from minimum to Q 1 ; Q 3 to maximum Example: pulse measures 62,64,68,70,70,74,.74,76,76,78,78,80. min=62, Q 1 = 69, median = 74, Q 3 = 77, max=80

Outlier Description of outlier: Data point far away from bulk of data How far? outlier < Q 1 1.5 IQR outlier > Q 3 + 1.5 IQR Example 12 pulse measures: 62,64,68,70,70,74,.74,76,76,78,78,80. Q 1 = 69, Q 3 = 77, IQR=77-69=8 outlier < 69 1.5 8 = 57 outlier > 77 + 1.5 8 = 89 Conclusion: there are no outliers 18

Example of outlier Example: Radish Growth in mm after 3 days: 3,5,5,7,7,8,9,.10,10,10,10,14,20,21 median = 9+10 2 = 9.5 Q 1 = 7, Q 3 = 10, IQR=10-7=3 outlier < Q 1 1.5 IQR = 7 1.5 3 = 2.5 outlier > Q 3 + 1.5 IQR = 10 + 1.5 3 = 14.5 outliers 20, 21 plotted as small circles Remark: There are statistical tests for outliers (see library outliers) 19

Series of box plots Use factor of m groups to produce m box plots > data(plantgrowth) > boxplot(plantgrowth$weight PlantGrowth$group)

Point and interval estimation: Notation Sample X 1,, X n (r.v.) with realizations x 1,, x n Parameter population estimator Sample estimate type of variable fixed random random fixed Mean µ µ X x Variance σ 2 σ 2 S 2 s 2 Standard Deviation σ σ S s Proportion π π p p Intensity λ λ l l X = 1 n n X i, S 2 = 1 n 1 p := l := n x = 1 n n x i, S = (X i X) 2, s 2 = 1 n 1 S 2, s = s 2 n (x i x) 2 number of succes in the sample sample size number of counts in the sample sample size = n S n = n C n

Maximum likelihood estimation (optional) n observations x 1,, x n from X 1,, X n iid rv likelihood of the data given model parameters θ n L(θ x) = P(X i = x i θ) log likelihood equals L(θ x) = log L(θ x) = n log P(X = x i θ) θ is Maximum Likelihood Estimator (MLE) of θ if it maximizes the log likelihood θ is statistic; function of X 1,, X n ; random variable If sample size n is large enough, then L has maximum If L differentiable, try to solve θ i L(θ x) = 0, i = 1, m or maximize L numerically (mainly Newton type algorithms) 22

Maximum likelihood estimation (optional) I(θ) = E n( θ θ) N ( 0, [ θ ] 2 log f (X) = ) 1, where I(θ) [ f ] (x) 2 f (x)dx f (x) denotes the information number (matrix); the amount of information about θ contained in X Example: MLE of Poisson intensity parameter λ = X λ P(X = x λ) = f (x) = λx x! e λ λx log f (x) = log λ x! e λ = λ x log λ log x! λ = x λ 1 = x λ λ [ ] X λ 2 E(X λ)2 I(λ) = E = λ λ 2 = 1 λ n( λ λ) N (0, λ) 23

Parameter estimation for normal distribution (optional) L(θ x) = = n { 1 σ 2π exp 1 2 { 1 ( σ 2π ) n exp ( ) } xi µ 2 σ 1 n ( ) } xi µ 2 2 σ n ( xi µ ) 2 L(θ x) = n 2 log(2πσ2 ) 1 2 µ L(θ x) = 1 σ 2 = n 2 log(2π) n 2 log(σ2 ) 1 2σ 2 n (x i µ) = 0 σ 2 L(θ x) = n 2σ 2 + 1 2σ 4 σ n (x i µ) 2 = 0 n (x i µ) 2

µ L(θ x) = 1 σ 2 n x i µ = 1 n n µ = n (x i µ) = 0 n x i nµ = 0 n x i µ = 1 n σ 2 L(θ x) = n 2σ 2 + 1 2σ 4 n 2σ 2 = 1 2σ 4 n X i = X n (x i µ) 2 = 0 n (x i µ) 2 σ 2 = 1 n σ 2 = 1 n n n (x i µ) 2 (x i µ) 2 = n 1 n S2 25

Desirable properties of estimators minimal Mean squared error (MSE) E [ θ θ ] 2 = V [ θ] + (E [ θ] θ ) 2 no bias; E [ θ] = θ ( no systematic error ) minimal variance V [ θ] = E( θ θ) 2 θ 1 more precise (efficient) than θ 2 if V [ θ1 ] < V [ θ2 ] θ efficient if V [ θ] is smallest possible θ consistent if V [ θ] 0, as n (LLN holds!) MLE may by slightly biased, but it is consistent and efficient 26

Confidence interval on the mean σ known Z is normally distributed with mean 0 and variance 1 Probability that Z takes values between z α/2 and z 1 α/2 is P { z α/2 Z z 1 α/2 } = 1 α The basis of confidence intervals! Remember Φ(z 0, 1) = P(Z z) and z α/2 = Φ 1 (α/2) = qnorm(α/2,0,1) If α =.05, then z 0.025 = Φ 1 (0.025) = qnorm(0.025) = 1.96 z 0.975 = Φ 1 (0.975) = qnorm(0.975) = 1.96 P { } z α/2 Z z 1 α/2 = P { 1.96 Z 1.96} = 0.95 27

Confidence Interval for µ X 1,, X n iid rv from normal population mean µ, variance σ 2 E[ µ] = E[X] = µ and V [ µ ] = V [ X ] = σ 2 /n, so that Z = X µ σ/ n is normally distributed with mean 0 and variance 1 { } P z α/2 X µ σ/ n z 1 α/2 = 1 α or, equivalently, after a some algebra, } σ σ P {X + z α/2 n µ X + z 1 α/2 n = 1 α interval X ± z 1 α/2 σ/ n contains µ in 95% of taking samples of size n 28

Algebra of rewriting the interval P P P P { { z α/2 { z α/2 z α/2 X µ σ/ n z 1 α/2 σ n X µ z 1 α/2 σ n X µ z 1 α/2 { z 1 α/2 σ n + X µ z α/2 P } = σ n } = } σ n X = } σ + X n {X + z α/2 σ n µ X + z 1 α/2 σ n } = using that z α/2 = z 1 α/2 29

CI Notation in the literature P {X + z α/2 σ n µ X + z 1 α/2 σ n } = 1 α the 1 α confidence interval I equal to [X + z α/2 σ n, X + z 1 α/2 σ n ] = [ µ + Φ 1 (α/2 0, 1) σ, µ + Φ 1 (1 α/2 0, 1) σ ] = n n ( [Φ 1 α/2 µ, ) σ n, Φ (1 1 α/2 µ, ) σ ] n This confidence interval is denoted by I 1 α (X µ) The unknown population mean µ is estimated by µ on the basis of data x 1,, x n 30

Three confidence intervals true interval I 1 α (X µ, σ 2 ) = = [µ + z α/2 σ n, µ + z 1 α/2 σ n ] ( [Φ 1 α/2 µ, ) σ n, Φ (1 1 α/2 µ, estimated interval, σ known I 1 α (X µ, σ 2 σ σ ] ) = [ µ + z α/2 n, µ + z 1 α/2 n ( ) = [Φ 1 σ α/2 µ,, Φ (1 1 α/2 µ, n estimated interval, σ unknown (most relevant!) I 1 α (X µ, σ 2 σ σ ] ) = [ µ + z α/2 n, µ + z 1 α/2 n ( ) = [Φ 1 σ α/2 µ,, Φ (1 1 α/2 µ, n ) σ ] n ) σ ] n ) σ ] n Illustration by simulation example from book: 31

alpha <- 0.05; mu <- 10; sigma <- 2; n <- 35 set.seed(222); x <- rnorm(n, mu, sigma) mu.hat <-mean(x) ; s <- sd(x) I.mu <- c(low = qnorm(alpha/2, mu, sigma/sqrt(n)), high = qnorm(1- alpha/2, mu, sigma/sqrt(n))) I.mu.hat <- c(low = qnorm(alpha/2, mu.hat, sigma/sqrt(n)), high = qnorm(1- alpha/2, mu.hat, sigma/sqrt(n))) I.mu.sigma.hat <- c(low = qnorm(alpha/2, mu.hat, s/sqrt(n)), high = qnorm(1- alpha/2, mu.hat, s/sqrt(n))) round(rbind( true interval = I.mu, estimated interval, sigma known = I.mu.hat, estimated interval, sigma unknown = I.mu.sigma.hat),2) low high true interval 9.34 10.66 estimated interval, sigma known 9.17 10.50 estimated interval, sigma unknown 9.18 10.49

Computing CI by MLE for Geometric parameter Sample 10 trials x until first success occurs Estimate π and its SE by MLE Construct CI library(mass) pi <- 0.1; alpha <- 0.05 x <- rgeom(n,pi); n <- 10 fit <- fitdistr(x, "geometric") pihat <- fit$estimate se <- fit$sd > pihat + c(-1,1) * qnorm(1-alpha/2)* se [1] 0.04257958 0.16360598

Example MLE: CI for mean daily energy intake Daily energy intake (Altman, 1991, p.183) of group of woman; recommended intake 7725 kj library(mass) x <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515, 8230,8770) fit <- fitdistr(x, "normal") muhat <- as.numeric(fit$estimate[1]) semuhat <- fit$sd[1] lower <- as.numeric(muhat + qnorm(alpha/2) * semuhat) upper <- as.numeric(muhat + qnorm(1-alpha/2)*semuhat) > round(c(muhat=muhat,lower=lower,upper=upper),1) muhat lower upper 6753.6 6110.1 7397.2 Conclusion: We are 95% certain that the population mean is in (6110.1, 7397.2)

Example MLE: esimation using mle Daily energy intake (Altman, 1991, p.183) of group of woman; recommended intake 7725 kj library(stats4) X <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515, 8230,8770) log.l <- function(mu = 7000, sigma = 1000){ # minus log-likelihood normal density n <- length(x) return(n * log(2 * pi * sigmaˆ2)/2 + sum((x - mu)ˆ2 / (2 * sigmaˆ2))) } fit <- mle(log.l) 35

Example MLE: output estimates and CI Recommended energy intake is 7725 kj > summary(fit) Maximum likelihood estimation Call: mle(minuslogl = log.l) Coefficients: Estimate Std. Error mu 6783.757 324.1454 sigma 1074.405 224.5461-2 log L: 147.1546 > confint(fit) Profiling... 2.5 % 97.5 % mu 6049.843 7457.935 sigma 754.606 1770.258 Conclusion: We are 95% certain the population mean energy intake is in (6049.843, 7457.935) 36

Remarks on Confidence Interval Remarks on CI true interval centered around µ is fixed estimated intervals σ (un)known centered around µ have random limits converging to true Effects on CI α decreases confidence level 1 α increases CI length increases n increases standard error s/ n decreases CI length decreases Teaching demonstration of CI Interactive graphical visualization of confidence intervals: library(teachingdemos) run.ci.examp(reps = 100, method="z", n=35) 37

Proportions sex ratio, success ratio, ratio of surviving patients N population size, N S number of successes in population n sample size, n S number of successes in sample π = N S N, π = p = n S n Number of successes in population has binomial density proportion p approximated by normal density if where np 5 and n(1 p) 5 E[p] = π, V [p] = π(1 π) n = σ2 n by the central limit theorem ( ) π(1 π) density of p normal density φ p π, n 38

39 CI for Proportions Z = π π σ/ n tends to normal with mean 0 and variance 1 { P z α/2 π π } σ/ n z 1 α/2 = 1 α after a some algebra } σ σ P { π + z α/2 n π π + z 1 α/2 n = 1 α Interval π ± z 1 α/2 σ/ n contains π in 95% of taking samples of size n; I 1 α (X π, σ 2 ) = [ π + z α/2 σ n, π + z 1 α/2 σ n ] = [ Φ (α/2 1 π, ) π(1 π) n, Φ (1 1 α/2 µ, ) π(1 π) ] n

Computation of CI for Proportions = I 1 α (X π, σ 2 ) [ ) Φ (α/2 1 π(1 π) ) π,, Φ (1 1 π(1 π) ] α/2 µ, n n estimated by c(qnorm(alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat)/n)), qnorm(1-alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat)/n))) Example: 39 patients out of 215 have asthma (Altman, 1991) Confidence interval for proportion n <- 215; n.s <- 39; pi.hat <- n.s/n; alpha <- 0.05 round(c( low= qnorm(alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat)/n)) high = qnorm(1-alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat) 3) low high 0.130 0.233

Comparison of CI for Proportions > library(hmisc) > round(binconf(n.s,n, method= all ),2) PointEst Lower Upper Exact 0.18 0.13 0.24 Wilson 0.18 0.14 0.24 Asymptotic 0.18 0.13 0.23 Recommendation: Use Wilson (c.f. L.D. Brown, T.T. Cai and A. DasGupta (2001). Interval estimation for a binomial proportion (with discussion). Statistical Science, 16:101-133, 2001.) 41

Bootstrap take 1000 random samples from the sample compute θ i from each re-sample compute mean of θ 1,, θ 1000 compute quantiles of θ 1,, θ 1000 compute histogram or density from θ 1,, θ 1000 42

Example: Daily energy intake Daily energy intake (Altman, 1991, p.183) of group of woman; recommended intake 7725 kj x <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515, 8230,8770) n <- length(x); n <- length(x) nboot <- 1000; bs <- double(nboot) for (i in 1:nboot){ resample <- x[sample(1:n,replace=true)] bs[i] <- mean(resample) } #boot statistic mu.0 <- 7725; x.bar <- mean(x); x.bar.boot <- mean(bs) > round(c(mu.0=mu.0, x.bar=x.bar,x.bar.boot=x.bar.boot) mu.0 x.bar x.bar.boot 7725.000 6753.636 6753.562 Sample mean and bootstrap mean are much smaller than recommended 43

hist(bs,freq=false,xlim=c(5500,8000),col= lightblue, main= Histogram and density curve,sub= bootstrap mea lines(density(bs));abline(v=7725) mtext("7725",side=1,at=7725,cex=1) 44