Lecture 6 Exploratory data analysis Point and interval estimation Dr. Wim P. Krijnen Lecturer Statistics University of Groningen Faculty of Mathematics and Natural Sciences Johann Bernoulli Institute for Mathematics and Computer Science October 26, 2010
Lecture overview Exploratory data analysis Numerical summaries mean median measures of association between variable (Pearson product moment correlation, Spearman s rank correlation coefficient) brief overview of exploratory data visualizations Histogram and density plot Quantile-Quantile plot Empirical cumulative distribution function Box-and-wiskers-plot Point estimation by Maximum Likelihood Interval estimation 2
Exploratory data analysis generates hypotheses (inductive) let the data speak thick well on data you (don t) have Assumptions: 1. random samples 2. finite variance 3. population density unchanged under sampling 3
Numerical summaries sample X 1,, X n (rv) has realizations x 1,, x n ( R) sample mean X = 1 n n X i (= r.v.) is random variable with distribution sample mean with size n possibly infinite µ = x = 1 n n x i (= fixed number) is fixed number without distribution; always E[X] = µ popu- sample sample statistic lation estimator (rv) estimate (fixed) mean µ X = 1 n n X i x = 1 n n x i variance σ 2 S 2 = 1 n n 1 (X i X) 2 s 2 = 1 n n 1 (x i x) 2 4
Determinations of copper in wholemeal flour chem: 24 determinations of copper in wholemeal flour (ppm) Large study suggests µ = 3.68 (Venables & Ripley, 2002) Median = middle value of data (50% >, 50% <) trimmed mean = mean leaving out percentage of extreme data > library(mass) > c(mean(chem),median(chem)) [1] 4.280417 3.385000 > x <- sort(chem, decreasing=true, index.return=true) > x$x [1] 28.95 5.28 3.77 3.70 3.70 3.70 3.70 3.60 3 [13] 3.37 3.10 3.03 3.03 2.90 2.80 2.70 2.50 2 > plot(x$x) > mean(chem, trim = 1/24)#exclude smallest, largest [1] 3.253636 > x$x [1] 28.95 5.28 3.77 3.70 3.70 3.70 3.70 3.60 3 [13] 3.37 3.10 3.03 3.03 2.90 2.80 2.70 2.50 2 > mean(x$x[2:23]) #= 3.253636
sorted chem data 6
Measure of spread of data Range = largest minus smallest Sample variance= S 2 = 1 n n 1 (X i X) 2 Interquartile Range (IQR)= upper quartile - lower quartile lower/upper quartile have 25% / 75% lower values > range(chem) [1] 2.20 28.95 > c(var(chem),var(x$x[2:23])) [1] 28.0624042 0.4440338 #great difference! > summary(chem) Min. 1st Qu. Median Mean 3rd Qu. Max. 2.200 2.775 3.385 4.280 3.700 28.950 > IQR(chem) [1] 0.925 > quantile(chem,3/4) - quantile(chem,1/4) 0.925 7
Chebyshev and empirical rules P ( X µ < kσ) 1 1 k 2 probability is at least 1 1/k 2 that X takes value k standard deviations from the mean it is general, but often imprecise Empirical rule for approximately normal data 68% of observations within 1 standard deviation from mean 95% of observations within 2 sd from mean 99.7% of observations within 3 sd from mean > pnorm(1) - pnorm(-1) [1] 0.6826895 > pnorm(2) - pnorm(-2) [1] 0.9544997 8
Measures of association between variables Correlation coefficient: measure of strength of linear relationship (Pearson) ρ = COV (X, Y ) V [X] V [Y ] = E[(X µ X )(Y µ Y )] E[(X µx ) 2 ] E[(Y µ Y ) 2 ] Properties ρ = n (x i x i )(y j y j ) n (x i x i ) 2. n (y j y j ) 2 1 ρ 1; bounded measure of linear relationship if ρ = ±1 there are a and b: Y = ax + b ρ > 0 both X, Y increase/decrease ρ not robust against outliers under normality ρ measures stochastic dependence; ρ = 0 independence ρ is symmetric ρ(x, Y ) = ρ(y, X) 9
Teaching Demonstrations Interactive graphical visualizations of correlation coefficient Minimize other screens and interactively use Tk slider library(teachingdemos) run.cor2.examp(n=500, wait=false) sensitivity to outliers: put a few points in small circle, add one far away put.points.demo() Conclusion: extreme outlier can have large influence on (non-robust) statistic 10
Spearman s rank correlation coefficient in case of outliers: use rank correlation coefficient ρ = 12 n(n 1)(n + 1) n ( rank(x i ) n + 1 ) ( rank(y i ) n + 1 ) 2 2 Example: Is there a correlation between Hunter s L measure of lightness (x) to the averages of consumer panel scores averaged over 80 (y) for 9 lots of canned tuna. > x <- c(44.4, 45.9, 41.9, 53.3, 44.7, 44.1, 50.7, 45.2, 60.1) > y <- c( 2.6, 3.1, 2.5, 5.0, 3.6, 4.0, 5.2, 2.8, 3.8) 11
Assessment of tuna quality (Hollander & Wolfe, 1973) > n <- length(x) > sumr <- sum((rank(x)-(n+1)/2)*(rank(y)-(n+1)/2)) > (rhohat <- 12 * sumr /(n*(n-1)*(n+1))) [1] 0.6 #value of spearman rho > cor.test(x,y,method = "spearman") Spearman s rank correlation rho data: x and y S = 48, p-value = 0.0968 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.6 ρ has asymptotically normal distribution (CLT) p-value Conclusion: H 0 : ρ = 0 not rejected 12
Comparing Pearson p.m.c.c with Spearman s r.c.c. set.seed(110) x <- rnorm(15); y <- rnorm(15) #rho = 0 x[16] <- 10; y[16] <- 10 > c(cor.test(x,y,method = "pearson")$estimate, + cor.test(x,y,method = "spearman")$estimate) cor rho 0.8768005 0.3676471 Spearman rank correlation coefficient ρ more robust against outliers than Pearson correlation coefficient in case a suspicion: check for differences by computation or plotting assumption for normality does not hold 13
Basic visualizations of univariate data sets Histogram: estimates the density by presenting (relative) frequencies in consecutive intervals (bins) as height of bars (hist) Density plot: smooth graph representing estimated proportions per bin (plot(density(x))) Quantile-Quantile plot: represents as points the quantiles of the first distribution (x-coordinate) against the same quantile of the second (theoretical) distribution; all points on y = x line imply perfect match (qqplot) Empirical cumulative distribution function: step function Fn jumps i/n at observation values, where i is the number of tied observations at that value (plot(ecdf) Box-and-wiskers-plot: box between Q1 and Q3, with line segment for median Q 2, whiskers for minimum and maximum 14
R code for basic visualizations par(mfrow = c(2, 2)) x <- rnorm(100) hist(x,freq=false) qqnorm(x); qqline(x) plot(ecdf(x)) boxplot(x) par(mfrow = c(1, 1))
Illustrations of Box-and-whisker-plot five-number summary of data: minimum, first quartile Q 1, median Q 2, third quartile Q 3, maximum. box between Q1 and Q3, whisker from minimum to Q 1 ; Q 3 to maximum Example: pulse measures 62,64,68,70,70,74,.74,76,76,78,78,80. min=62, Q 1 = 69, median = 74, Q 3 = 77, max=80
Outlier Description of outlier: Data point far away from bulk of data How far? outlier < Q 1 1.5 IQR outlier > Q 3 + 1.5 IQR Example 12 pulse measures: 62,64,68,70,70,74,.74,76,76,78,78,80. Q 1 = 69, Q 3 = 77, IQR=77-69=8 outlier < 69 1.5 8 = 57 outlier > 77 + 1.5 8 = 89 Conclusion: there are no outliers 18
Example of outlier Example: Radish Growth in mm after 3 days: 3,5,5,7,7,8,9,.10,10,10,10,14,20,21 median = 9+10 2 = 9.5 Q 1 = 7, Q 3 = 10, IQR=10-7=3 outlier < Q 1 1.5 IQR = 7 1.5 3 = 2.5 outlier > Q 3 + 1.5 IQR = 10 + 1.5 3 = 14.5 outliers 20, 21 plotted as small circles Remark: There are statistical tests for outliers (see library outliers) 19
Series of box plots Use factor of m groups to produce m box plots > data(plantgrowth) > boxplot(plantgrowth$weight PlantGrowth$group)
Point and interval estimation: Notation Sample X 1,, X n (r.v.) with realizations x 1,, x n Parameter population estimator Sample estimate type of variable fixed random random fixed Mean µ µ X x Variance σ 2 σ 2 S 2 s 2 Standard Deviation σ σ S s Proportion π π p p Intensity λ λ l l X = 1 n n X i, S 2 = 1 n 1 p := l := n x = 1 n n x i, S = (X i X) 2, s 2 = 1 n 1 S 2, s = s 2 n (x i x) 2 number of succes in the sample sample size number of counts in the sample sample size = n S n = n C n
Maximum likelihood estimation (optional) n observations x 1,, x n from X 1,, X n iid rv likelihood of the data given model parameters θ n L(θ x) = P(X i = x i θ) log likelihood equals L(θ x) = log L(θ x) = n log P(X = x i θ) θ is Maximum Likelihood Estimator (MLE) of θ if it maximizes the log likelihood θ is statistic; function of X 1,, X n ; random variable If sample size n is large enough, then L has maximum If L differentiable, try to solve θ i L(θ x) = 0, i = 1, m or maximize L numerically (mainly Newton type algorithms) 22
Maximum likelihood estimation (optional) I(θ) = E n( θ θ) N ( 0, [ θ ] 2 log f (X) = ) 1, where I(θ) [ f ] (x) 2 f (x)dx f (x) denotes the information number (matrix); the amount of information about θ contained in X Example: MLE of Poisson intensity parameter λ = X λ P(X = x λ) = f (x) = λx x! e λ λx log f (x) = log λ x! e λ = λ x log λ log x! λ = x λ 1 = x λ λ [ ] X λ 2 E(X λ)2 I(λ) = E = λ λ 2 = 1 λ n( λ λ) N (0, λ) 23
Parameter estimation for normal distribution (optional) L(θ x) = = n { 1 σ 2π exp 1 2 { 1 ( σ 2π ) n exp ( ) } xi µ 2 σ 1 n ( ) } xi µ 2 2 σ n ( xi µ ) 2 L(θ x) = n 2 log(2πσ2 ) 1 2 µ L(θ x) = 1 σ 2 = n 2 log(2π) n 2 log(σ2 ) 1 2σ 2 n (x i µ) = 0 σ 2 L(θ x) = n 2σ 2 + 1 2σ 4 σ n (x i µ) 2 = 0 n (x i µ) 2
µ L(θ x) = 1 σ 2 n x i µ = 1 n n µ = n (x i µ) = 0 n x i nµ = 0 n x i µ = 1 n σ 2 L(θ x) = n 2σ 2 + 1 2σ 4 n 2σ 2 = 1 2σ 4 n X i = X n (x i µ) 2 = 0 n (x i µ) 2 σ 2 = 1 n σ 2 = 1 n n n (x i µ) 2 (x i µ) 2 = n 1 n S2 25
Desirable properties of estimators minimal Mean squared error (MSE) E [ θ θ ] 2 = V [ θ] + (E [ θ] θ ) 2 no bias; E [ θ] = θ ( no systematic error ) minimal variance V [ θ] = E( θ θ) 2 θ 1 more precise (efficient) than θ 2 if V [ θ1 ] < V [ θ2 ] θ efficient if V [ θ] is smallest possible θ consistent if V [ θ] 0, as n (LLN holds!) MLE may by slightly biased, but it is consistent and efficient 26
Confidence interval on the mean σ known Z is normally distributed with mean 0 and variance 1 Probability that Z takes values between z α/2 and z 1 α/2 is P { z α/2 Z z 1 α/2 } = 1 α The basis of confidence intervals! Remember Φ(z 0, 1) = P(Z z) and z α/2 = Φ 1 (α/2) = qnorm(α/2,0,1) If α =.05, then z 0.025 = Φ 1 (0.025) = qnorm(0.025) = 1.96 z 0.975 = Φ 1 (0.975) = qnorm(0.975) = 1.96 P { } z α/2 Z z 1 α/2 = P { 1.96 Z 1.96} = 0.95 27
Confidence Interval for µ X 1,, X n iid rv from normal population mean µ, variance σ 2 E[ µ] = E[X] = µ and V [ µ ] = V [ X ] = σ 2 /n, so that Z = X µ σ/ n is normally distributed with mean 0 and variance 1 { } P z α/2 X µ σ/ n z 1 α/2 = 1 α or, equivalently, after a some algebra, } σ σ P {X + z α/2 n µ X + z 1 α/2 n = 1 α interval X ± z 1 α/2 σ/ n contains µ in 95% of taking samples of size n 28
Algebra of rewriting the interval P P P P { { z α/2 { z α/2 z α/2 X µ σ/ n z 1 α/2 σ n X µ z 1 α/2 σ n X µ z 1 α/2 { z 1 α/2 σ n + X µ z α/2 P } = σ n } = } σ n X = } σ + X n {X + z α/2 σ n µ X + z 1 α/2 σ n } = using that z α/2 = z 1 α/2 29
CI Notation in the literature P {X + z α/2 σ n µ X + z 1 α/2 σ n } = 1 α the 1 α confidence interval I equal to [X + z α/2 σ n, X + z 1 α/2 σ n ] = [ µ + Φ 1 (α/2 0, 1) σ, µ + Φ 1 (1 α/2 0, 1) σ ] = n n ( [Φ 1 α/2 µ, ) σ n, Φ (1 1 α/2 µ, ) σ ] n This confidence interval is denoted by I 1 α (X µ) The unknown population mean µ is estimated by µ on the basis of data x 1,, x n 30
Three confidence intervals true interval I 1 α (X µ, σ 2 ) = = [µ + z α/2 σ n, µ + z 1 α/2 σ n ] ( [Φ 1 α/2 µ, ) σ n, Φ (1 1 α/2 µ, estimated interval, σ known I 1 α (X µ, σ 2 σ σ ] ) = [ µ + z α/2 n, µ + z 1 α/2 n ( ) = [Φ 1 σ α/2 µ,, Φ (1 1 α/2 µ, n estimated interval, σ unknown (most relevant!) I 1 α (X µ, σ 2 σ σ ] ) = [ µ + z α/2 n, µ + z 1 α/2 n ( ) = [Φ 1 σ α/2 µ,, Φ (1 1 α/2 µ, n ) σ ] n ) σ ] n ) σ ] n Illustration by simulation example from book: 31
alpha <- 0.05; mu <- 10; sigma <- 2; n <- 35 set.seed(222); x <- rnorm(n, mu, sigma) mu.hat <-mean(x) ; s <- sd(x) I.mu <- c(low = qnorm(alpha/2, mu, sigma/sqrt(n)), high = qnorm(1- alpha/2, mu, sigma/sqrt(n))) I.mu.hat <- c(low = qnorm(alpha/2, mu.hat, sigma/sqrt(n)), high = qnorm(1- alpha/2, mu.hat, sigma/sqrt(n))) I.mu.sigma.hat <- c(low = qnorm(alpha/2, mu.hat, s/sqrt(n)), high = qnorm(1- alpha/2, mu.hat, s/sqrt(n))) round(rbind( true interval = I.mu, estimated interval, sigma known = I.mu.hat, estimated interval, sigma unknown = I.mu.sigma.hat),2) low high true interval 9.34 10.66 estimated interval, sigma known 9.17 10.50 estimated interval, sigma unknown 9.18 10.49
Computing CI by MLE for Geometric parameter Sample 10 trials x until first success occurs Estimate π and its SE by MLE Construct CI library(mass) pi <- 0.1; alpha <- 0.05 x <- rgeom(n,pi); n <- 10 fit <- fitdistr(x, "geometric") pihat <- fit$estimate se <- fit$sd > pihat + c(-1,1) * qnorm(1-alpha/2)* se [1] 0.04257958 0.16360598
Example MLE: CI for mean daily energy intake Daily energy intake (Altman, 1991, p.183) of group of woman; recommended intake 7725 kj library(mass) x <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515, 8230,8770) fit <- fitdistr(x, "normal") muhat <- as.numeric(fit$estimate[1]) semuhat <- fit$sd[1] lower <- as.numeric(muhat + qnorm(alpha/2) * semuhat) upper <- as.numeric(muhat + qnorm(1-alpha/2)*semuhat) > round(c(muhat=muhat,lower=lower,upper=upper),1) muhat lower upper 6753.6 6110.1 7397.2 Conclusion: We are 95% certain that the population mean is in (6110.1, 7397.2)
Example MLE: esimation using mle Daily energy intake (Altman, 1991, p.183) of group of woman; recommended intake 7725 kj library(stats4) X <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515, 8230,8770) log.l <- function(mu = 7000, sigma = 1000){ # minus log-likelihood normal density n <- length(x) return(n * log(2 * pi * sigmaˆ2)/2 + sum((x - mu)ˆ2 / (2 * sigmaˆ2))) } fit <- mle(log.l) 35
Example MLE: output estimates and CI Recommended energy intake is 7725 kj > summary(fit) Maximum likelihood estimation Call: mle(minuslogl = log.l) Coefficients: Estimate Std. Error mu 6783.757 324.1454 sigma 1074.405 224.5461-2 log L: 147.1546 > confint(fit) Profiling... 2.5 % 97.5 % mu 6049.843 7457.935 sigma 754.606 1770.258 Conclusion: We are 95% certain the population mean energy intake is in (6049.843, 7457.935) 36
Remarks on Confidence Interval Remarks on CI true interval centered around µ is fixed estimated intervals σ (un)known centered around µ have random limits converging to true Effects on CI α decreases confidence level 1 α increases CI length increases n increases standard error s/ n decreases CI length decreases Teaching demonstration of CI Interactive graphical visualization of confidence intervals: library(teachingdemos) run.ci.examp(reps = 100, method="z", n=35) 37
Proportions sex ratio, success ratio, ratio of surviving patients N population size, N S number of successes in population n sample size, n S number of successes in sample π = N S N, π = p = n S n Number of successes in population has binomial density proportion p approximated by normal density if where np 5 and n(1 p) 5 E[p] = π, V [p] = π(1 π) n = σ2 n by the central limit theorem ( ) π(1 π) density of p normal density φ p π, n 38
39 CI for Proportions Z = π π σ/ n tends to normal with mean 0 and variance 1 { P z α/2 π π } σ/ n z 1 α/2 = 1 α after a some algebra } σ σ P { π + z α/2 n π π + z 1 α/2 n = 1 α Interval π ± z 1 α/2 σ/ n contains π in 95% of taking samples of size n; I 1 α (X π, σ 2 ) = [ π + z α/2 σ n, π + z 1 α/2 σ n ] = [ Φ (α/2 1 π, ) π(1 π) n, Φ (1 1 α/2 µ, ) π(1 π) ] n
Computation of CI for Proportions = I 1 α (X π, σ 2 ) [ ) Φ (α/2 1 π(1 π) ) π,, Φ (1 1 π(1 π) ] α/2 µ, n n estimated by c(qnorm(alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat)/n)), qnorm(1-alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat)/n))) Example: 39 patients out of 215 have asthma (Altman, 1991) Confidence interval for proportion n <- 215; n.s <- 39; pi.hat <- n.s/n; alpha <- 0.05 round(c( low= qnorm(alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat)/n)) high = qnorm(1-alpha/2, pi.hat, sqrt(pi.hat*(1-pi.hat) 3) low high 0.130 0.233
Comparison of CI for Proportions > library(hmisc) > round(binconf(n.s,n, method= all ),2) PointEst Lower Upper Exact 0.18 0.13 0.24 Wilson 0.18 0.14 0.24 Asymptotic 0.18 0.13 0.23 Recommendation: Use Wilson (c.f. L.D. Brown, T.T. Cai and A. DasGupta (2001). Interval estimation for a binomial proportion (with discussion). Statistical Science, 16:101-133, 2001.) 41
Bootstrap take 1000 random samples from the sample compute θ i from each re-sample compute mean of θ 1,, θ 1000 compute quantiles of θ 1,, θ 1000 compute histogram or density from θ 1,, θ 1000 42
Example: Daily energy intake Daily energy intake (Altman, 1991, p.183) of group of woman; recommended intake 7725 kj x <- c(5260,5470,5640,6180,6390,6515,6805,7515,7515, 8230,8770) n <- length(x); n <- length(x) nboot <- 1000; bs <- double(nboot) for (i in 1:nboot){ resample <- x[sample(1:n,replace=true)] bs[i] <- mean(resample) } #boot statistic mu.0 <- 7725; x.bar <- mean(x); x.bar.boot <- mean(bs) > round(c(mu.0=mu.0, x.bar=x.bar,x.bar.boot=x.bar.boot) mu.0 x.bar x.bar.boot 7725.000 6753.636 6753.562 Sample mean and bootstrap mean are much smaller than recommended 43
hist(bs,freq=false,xlim=c(5500,8000),col= lightblue, main= Histogram and density curve,sub= bootstrap mea lines(density(bs));abline(v=7725) mtext("7725",side=1,at=7725,cex=1) 44