Boxplots A boxplot, or box-and-whisker plot, is a convenient way of summarising data, particularly when the data is made up of several groups.

Size: px

Start display at page:

Download "Boxplots A boxplot, or box-and-whisker plot, is a convenient way of summarising data, particularly when the data is made up of several groups."

Harvey Russell
5 years ago
Views:

1 Boxplots A boxplot, or box-and-whisker plot, is a convenient way of summarising data, particularly when the data is made up of several groups Boxplot largest value (there are no extreme values) upper quartile median lower quartile smallest non extreme value extreme value The box extends from one quartile to the other, and the central line in the box is the median. The whiskers are drawn from the box to the most extreme observations that are no more than.5 IQR from the box. (Alternatively r IQR can be used for other values of r.) Observations which are more extreme than this are shown separately. 2 Gross national income per capita (World Bank data, 29) for 5 sovereign states in Europe. sovereign_states_in_europe_by_gni_(nominal)_per_capita Gross national income per capita (US dollars) Monaco Liechtenstein Now for 82 countries worldwide (including Europe) Monaco Liechtenstein Norway Luxembourg Switzerland Denmark Sweden GNI per capita log(gni per capita) 3 4

2 Parallel boxplots are often useful to show the differences between subgroups of the data. Below: InsectSprays data from R. A B C D E F Type of spray Insect count 5 Comparative boxplots of transformed GCSE scores by A-level chemistry exam score ( = worst, 2, 4, 6, 8, = best) and gender. Transformed GCSE score 5 M F M F 2 M F 4 M F 6 M F 8 M F 6 Comparing N(, ) and t distributions A t-distribution with r degrees of freedom has pdf f (x) ( + x 2 /r) (r+)/2, < x <. [More on t-distributions later.] Consider r = x f(x) N(,) pdf t 5 pdf 7 Suppose we simulate data (x,..., x 25 ) from a t 5 distribution. Using Q-Q plots we can consider the questions: is it reasonable to assume (x,..., x 25 ) is from a N(, )? is it reasonable to assume (x,..., x 25 ) is from a t 5? 8

3 Q Q Plot of data against a N(,) Theoretical Quantiles for a N(,) Sample Quantiles A N(, ) assumption is not good as expected Q Q Plot of data against a t5 Theoretical Quantiles for a t 5 Sample Quantiles A t 5 assumption is ok as expected. Normal Q-Q plots Michelson Morley (879) Speed of Light Data Experiment No. Speed of light (km/s minus 299,) true speed 2 observations from each experiment. Is a N(µ, σ 2 ) distribution plausible for these observations? Normal Q Q Plot for Michelson Morley data Theoretical Quantiles Sample Quantiles From the plot a normal distribution seems reasonable. 2

4 Below: precip data from R average precipitation for 7 US cities. Normal Q Q Plot Below: Newcomb s (882) speed of light data measurements are the time (in deviations from 248 nanoseconds) to travel about 74m. The currently accepted time (on this scale) is 33. Rainfall (inches per year) for 7 US cities Frequency Histogram of Newcomb's data Theoretical Quantiles A normal assumption doesn t look good problems in the lower tail. Time 3 4 This time the problems are different two (very small) outlying observations. If these are removed, a normal assumption looks ok. Example: Danish fire data (Davison, 23) Data on the times, and amounts, of major insurance claims due to fire in Denmark Q Q Plot of Newcomb's data Q Q Plot after deleting two points Sample Quantiles Sample Quantiles Claim amount Claim index Theoretical Quantiles Theoretical Quantiles Following Davison, let s consider the 254 largest claim amounts, and the interarrival times between these claims. 5 6

5 Is it reasonable to assume exponential interarrival times? See below inter-arrivals look fairly close to exponential. Is it reasonable to assume exponential claim amounts? See below an exponential assumption is not reasonable. Exponential Q Q Plot of interarrival times Exponential Q Q Plot of claim amounts Ordered interarrival times x[k] Ordered claim amounts log[ k/(n+)] log[ k/(n+)] 7 8 Is it reasonable to assume Pareto claim amounts? See below the Pareto fits fairly well. Pareto Q Q Plot of claim amounts Multivariate normal distribution [Multivariate/bivariate normal: recap from Part A Probability.] () Let Z,..., Z p iid N(, ). Then joint pdf of Z,..., Z p is log(ordered claim amounts) f (z) = = = p j= ( exp ) (2π) /2 2 z2 j ) exp (2π) p/2 exp (2π) p/2 ( 2 ( 2 zt z p z 2 j j= ), z R p We write Z N(, I) where is a p-vector of zeroes and I is the p p identity matrix. log[ k/(n+)] 9 2

6 z z z (2) Now let µ be a p -vector and let Σ be a p p symmetric, positive definite matrix. Then X = (X,..., X p ) is said to have a multivariate normal distribution with mean vector µ and covariance matrix Σ if its pdf is [ f (x) = (2π) p/2 exp ] det Σ /2 2 (x µ)t Σ (x µ). When X N(µ, Σ) µ is the mean vector: E(X j ) = µ j Σ is the covariance matrix: var(x j ) = Σ jj and cov(x j, X k ) = Σ jk the marginal distribution of X j is X j N(µ j, Σ jj ). We write X N(µ, Σ). This pdf reduces to the N(, I) pdf when µ = and Σ = I Example (bivariate normal distribution) Below: pdf of (X, X 2 ) for ρ =,.5,.9, and contour-plot for ρ =.5. rho = rho =.5 Suppose p = 2. Let < ρ < and µ = Then the pdf of (X, X 2 ) is ( ), Σ = ( ) ρ. ρ ( ) f (x, x 2 ) = 2π ρ exp 2 2( ρ 2 ) (x2 2ρx x 2 + x 2 2 ). The marginal distributions are X N(, ) and X 2 N(, ). The quantity ρ is the correlation between X and X 2. X and X 2 are independent ρ = x2 x2 2 2 rho = x x 2 2 x rho =.5.2 x x x 23 24

7 Opinion polls (revision) Smoking should be banned completely In an ICM poll for the BBC (Feb 26), 485 people agreed with this statement, and 5 disagreed. Let X i = { if ith person agreed otherwise and assume X,..., X n iid Bernoulli(p), with n = 986. So p = x. By CLT, ( P z α/2 < X p p( p)/n < z α/2 ) α. For the above data, and using α =.5, an approximate 95% CI for p is (.46,.523). Smoking should be banned in all public places but not completely 645 agreed, 34 disagreed This time the approximate 95% CI for p is (.624,.684). Estimating the variance p( p)/n by p( p)/n, we get an approximate α CI for p of (x ± z α/2 p( p)/n) Chi-squared pdfs t distribution pdfs f(y) r = 2 r = 4 r = 6 f(t) r = r = 2 r = 5 N(,) y t 27 28

8 Student s Sleep data Normal Q Q Plot of Sleep data Student = W.S. Gosset Below is half of Student s sleep data (98):.7,.6,.2,.2,., 3.4, 3.7,.8,., 2.. The data give the number of hours of sleep gained, by patients, following a low dose of a drug. [The other half of the data give the sleep gained following a normal dose of the drug.] Sample Quantiles 2 3 A point estimate of the sleep gained is x =.75 hours Theoretical Quantiles 29 3 Treating the sample as iid N(µ, σ 2 ), with µ and σ 2 unknown, a 95% CI for µ is ( x ± t n ( α 2 ) s ) = (.53, 2.3) n using x =.75, s 2 = 3.2, n =, α =.5, t 9 (.25) = The value of t 9 (.25) comes from statistical tables, or from R. Here, it would be incorrect to use a N(, ) distribution instead of a t 9. E.g. Suppose we assume σ 2 = s 2 = 3.2 (the sample variance) and calculate the interval ( ) 3.2 x ±.96 = (.36,.86). The interval (.53, 2.3) obtained using the t 9 distribution is wider than the interval (.36,.86). The interval from the t 9 distribution is the correct one here. Since σ 2 is unknown, we need to estimate it (our estimate is s 2 ). Since we are estimating σ 2, there is more uncertainty than if σ 2 were known, and the t 9 distribution correctly takes this uncertainty into account. 3 32

9 Sleep data (low dose) Number of hours of sleep gained, by patients:.7,.6,.2,.2,., 3.4, 3.7,.8,., 2.. Do the data support the conclusion that a low dose of the drug makes people sleep more, or not? We will start from the default position that the drug has no effect, and we will only reject this default position if the data contain sufficient evidence for us to reject it. So we would like to consider (i) the null hypothesis that the drug has no effect, and (ii) the alternative hypothesis that the drug makes people sleep more. We will denote the null hypothesis by H, and the alternative hypothesis by H Sleep data (normal dose) t-test (one sample) [Example from Dalgaard (28).] Data on the daily energy intake (in kj) of women: The other half of the sleep data is the number of hours of sleep gained, by the same patients, following a normal dose of the drug:.9,.8,.,.,., 4.4, 5.5,.6, 4.6, 3.4. Is there evidence that a normal dose of the drug makes people sleep more than not taking a drug at all, or not? 526, 547, 564, 68, 639, 655, 685, 755, 755, 823, 877. Do these values deviate from a recommended value of 7725 kj? We consider testing H : µ = µ against H : µ = µ, where µ = 7725, and we make the standard assumptions for a t-test. We have t obs = x µ s/ n = The p-value is p = 2P(t t obs ) =.8. So we conclude that there is good evidence to reject the null hypothesis that the mean intake is 7725 kj

10 t-test (two sample) Darwin s Zea mays data heights of young maize plants. Testing H : µ = 7725 against H : µ < 7725, the p-value is p = P(t t obs ) =.9. Conclusion: there is good evidence to reject H in favour of H. Testing H : µ = 7725 against H + : µ > 7725, the p-value is p + = P(t t obs ) =.99. Conclusion: there is no evidence to reject H in favour of H +. Height (eights of an inch) Crossed Self-fertilized Are the heights of the two types of plant the same? [In fact, the plants were in pairs one cross- and one self-fertilized in each pair we ignore this pairing for now. We ll see how to deal with pairing later.] Height Assume we have two independent samples X,..., X m iid N(µX, σ 2 ), and Y,..., Y n iid N(µY, σ 2 ), where σ 2 is unknown. Suppose we would like to test H : µ X = µ Y against H : µ X = µ Y. Let T = X Y S m + n where S 2 = m+n 2 [ (X i X) 2 + (Y i Y) 2 ]. Assuming H is true, we have T t m+n 2. Cross Self Type 39 4

11 t-test (paired) For the maize data, the observed value of T is t = x y s m + n = The alternative hypothesis (µ X = µ Y ) is two-sided, so the p-value of this test is p = 2P(t ) =.2. Conclusion: there is good evidence to reject the null hypothesis µ X = µ Y. Suppose we have pairs of RVs (X i, Y i ), i =..., n. Let D i = X i Y i. Suppose D,..., D n iid N(µ, σ 2 ), with σ 2 unknown, and that we want to test a hypothesis about µ. We can use the test statistic D µ S D / n which has a t n distribution under H : µ = µ. (Here, S 2 D is the sample variance of the D i.) The kind of situation where a paired test is used is when there are two measurements on the same experimental unit, e.g. in the sleep data, low and normal doses were given to the same patients Two sample t and paired t Is the amount of sleep gained with a low dose the same as the amount gained with a high dose? low (X) normal (Y) difference (D) Two sample t-test of H : µ X = µ Y against H : µ X = µ Y : the p-value is.79. Paired t-test (of µ = ), based on the differences D i : the p-value is.28. The paired test uses the information that the observations are paired: i.e. we have one low and one high dose observation per patient. The two sample test ignores this information. Prefer the paired test here. Could consider one-sided alternatives here. Hypothesis testing and confidence intervals For the maize data: the 95% (equal tail) confidence interval for µ X µ Y is (3.34, 38.53) (see Sheet 2, Question 5) when testing µ x = µ Y against µ x = µ Y, the p-value is.2. So, observe that (i) the p-value less than.5 (ii) the 95% confidence interval does not contain (= the value of µ X µ Y under H ). (i) and (ii) both being true is not a coincidence there is a connection between hypothesis tests and confidence intervals

12 Insect traps If the test has size α, then α = P( X i c H ). 33 insect traps were set out across sand dunes and the numbers of insects caught in a fixed time were counted (Gilchrist, 984). The number of traps containing various numbers of the taxa Staphylinoidea were as follows. Count Frequency Suppose X,..., X 33 iid Poisson(λ). Consider testing H : λ = against H : λ = λ, where λ >. The NP lemma leads to a test of the form reject H x i c. Under H, we have X i Poisson(33) exactly. However, instead of using this we can use a normal approximation: ( Xi 33 α = P c 33 ) H and, by the CLT, if H is true then X i ( c 33 α Φ ). 33 Hence c z α, so c 33 + z α 33. D N(, ), so So we have a critical region C = {x : x i 33 + z α 33}. Note that C does not depend on which value of λ > we are considering, so we actually have a UMP test of λ = against λ >. If α =. then c 47; if α =. then c 5. The observed value of x i is 54. So in both cases the observed value of 54 is c, so in both cases we d reject H. An alternative way of thinking about this is to calculate the p-value: p = P(we observe a value at least as extreme as 54 H ) = P( X i 54 H ).5 which is very strong evidence for rejecting H. Note that a test of size α rejects H if and only if α p. That is, the p-value is the smallest value of α for which H would be rejected. (This is true generally, not just in this particular example.) In practice, no-one tells us a value of α, we have to judge the situation for ourselves. Our conclusion here is that there is very strong evidence for rejecting H

13 Hardy Weinberg equilibrium In a sample from the Chinese population of Hong Kong, blood types occurred with the following frequencies (Rice, 995): Blood type M MN N Total Frequency If gene frequencies are in Hardy Weinberg equilibrium, then the probability of an individual having blood type M, MN, or N should be The observed frequencies are (n, n 2, n 3 ) = (342, 5, 87), with total n = n + n 2 + n 3 = 29. The likelihood is so the log-likelihood is L(θ) [( θ) 2 ] n [θ( θ)] n 2 [θ 2 ] n 3 l(θ) = (2n + n 2 ) log( θ) + (n 2 + 2n 3 ) log θ + constant from which we obtain P(M) = ( θ) 2 P(MN) = 2θ( θ) P(N) = θ 2. θ = n 2 + 2n 3 2n = So π ( θ) = ( θ) 2, π 2 ( θ) = 2 θ( θ), π 3 ( θ) = θ 2 and ( ) ni Λ = 2 n i log =.32. i nπ i ( θ) We compare Λ to a χ 2 p where p = dim Θ dim Θ = (3 ) =. The value Λ =.32 is much less than E(χ 2 ) =. The p-value is P(χ 2.32) =.86, so there is no reason to doubt the Hardy Weinberg model. Pearson s chi-squared statistic leads to the same conclusion P = [n i nπ i ( θ)] 2 nπ i ( θ) =.39. Insect counts (Bliss and Fisher, 953) [Example from Rice (995).] From each of 6 apple trees in an orchard that had been sprayed, 25 leaves were selected. On each of the leaves, the number of adult female red mites was counted. Number per leaf Observed frequency Does a Poisson(θ) model fit these data? As usual for a Poisson, θ = x =.47, and π i ( θ) = θ i e θ /i!, i =,,..., 7 π 8 ( θ) = 7 i= π i ( θ). The expected frequency in cell i is nπ i ( θ). 5 52

14 Some expected frequencies are very small: # per leaf Observed Expected The χ 2 approximation for the distribution of Λ applies when there are large counts. The usual rule-of-thumb is that the χ 2 approximation is good when the expected frequency in each cell is at least 5. To ensure this, we should pool some cells before calculating Λ or P. After pooling cells 3: # per leaf 2 3 Observed Expected Then Λ = 2 O i log ( O i E i ) = 26.6, and P = (Oi E i ) 2 /E i = These are to be compared with a χ 2 with (4 ) = 2 degrees of freedom. The p-value is p = P(χ ) 6, so there is clear evidence that a Poisson model is not suitable Hair and Eye Colour Relation between hair and eye colour The hair and eye colour of 592 statistics students at the University of Delaware were recorded (Snee, 974) dataset HairEyeColor in R. Black Brown Blue Hazel Green Eye colour Hair colour Brown Blue Hazel Green Black Brown Red Blond Are hair colour and eye colour independent? Hair Brown lond Red Eye 55 56

15 Bayesian Inference Λ = 2 r c i= j= dim(θ) = 6 = 5 ( ) nij n n ij log = 46.4 n i. n.j dim(θ ) = (4 ) + (4 ) = 6 Hence we compare Λ to a χ 2 p where p = 5 6 = 9. The p-value is P(χ ). So there is overwhelming evidence of an association between hair colour and eye colour (i.e. overwhelming evidence that they are not independent). [Pearson s chi-squared statistic is P = 38.3.] So far we have followed the classical/frequentist approach to statistics: we have treated unknown parameters as a fixed constants, and we have imagined repeated sampling from our model in order to evaluate properties of estimators, interpret confidence intervals, calculate p-values, etc. We now take a different approach: in Bayesian inference, unknown parameters are treated as random variables Example In subjective Bayesian inference, probability is a measure of the strength of belief. Before any data are available, there is uncertainty about the parameter θ. Suppose uncertainty about θ is expressed as a prior pdf (of pmf) for θ. Then, once data are available, we can use Bayes theorem to combine our prior beliefs with the data to obtain an updated posterior assessment of our beliefs about θ. Suppose we have a coin which we think might be a bit biased. Let θ be the probability of getting a head when we flip it. 59 6

16 Prior: Beta(5, 5). Data: 7 heads from flips. Prior: Beta(5, 5). Data: 7 heads from flips. Prior density Prior density Posterior density Posterior density Example (MRSA) [Example from Let θ denote the number of MRSA infections per, bed-days in a hospital. Suppose we observe y = 2 infections in 4, bed-days, i.e. in,n bed-days where N = 4. A simple estimate of θ is y/n = 5 infections per, bed-days. The MLE of θ is also θ = 5 if we assume that y is an observation from a Poisson distribution with mean θn, so f (y θ) = (θn) y e θn /y!. However, other evidence about θ may exist. Suppose this other information, on its own, suggests plausible values of θ of about per,, with 95% of the support for θ lying between 5 and 7. We can use a prior distribution to describe this. A Gamma pdf is convenient here: π(θ) = βα Γ(α) θα e βθ for θ >. Taking α =, β = gives approximately the properties above. The posterior combines the evidence from the data (i.e. the likelihood) and the other (i.e. prior) evidence. We can think of the posterior as a compromise between the likelihood and the prior. Calculated on board in lectures: the posterior is another Gamma

17 Prior density posterior likelihood prior Likelihood Posterior density Prior information Example How do we choose a prior π(θ)? (i) E.g. MRSA example, we might ask a scientific expert who might anticipate θ around, say with θ (5, 7) with probability.95. (ii) We might have little prior knowledge, so we might want a prior that expresses prior ignorance. E.g. if a probability is unknown, we might consider the prior θ U(, ). (iii) In the Bernoulli/Beta and Poisson/Gamma examples (Section 4.), the posterior was of same form as prior, i.e. Beta-Beta and Gamma-Gamma. This occurred because the likelihood and prior had the same functional form the prior and likelihood are said to be conjugate (convenient for doing calculations by hand). [Example from Carlin and Louis (28).] Product P old, standard. Product P newer, more expensive. Assumptions: the probability θ that a customer prefers P has prior π(θ) which is Beta(a, b) the number of customers X (out of n) that prefer P is X Binomial(n, θ). Let s say θ.6 means that P is a substantial improvement over P. So take H : θ.6 and H : θ <

18 We consider 3 possibile priors: Jeffreys prior: θ Beta(.5,.5). Uniform prior: θ Beta(, ). Sceptical prior: θ Beta(2, 2), i.e. favours values of θ near 2. prior density Jeffreys prior Uniform prior Sceptical prior Prior odds = P(H )/P(H ) where P(H ) = P(H ) =.6.6 B(a, b) θa ( θ) b dθ B(a, b) θa ( θ) b dθ. Suppose we have x = 3 successes from n = 6 customers. Then (Section 4.) the posterior π(θ x) is Beta(x + a, n x + b) with x = 3 and n = 6. Posterior odds = P(H x)/p(h x) where P(H x) = P(H x) =.6.6 B(x + a, n x + b) θx+a ( θ) n x+b dθ B(x + a, n x + b) θx+a ( θ) n x+b dθ. 7 72

19 posterior density Beta(3.5, 3.5) Beta(4, 4) Beta(5, 5) Prior Prior odds Posterior odds Bayes factor Beta(.5,.5) Beta(, ) Beta(2, 2) Conclusion: strong evidence for H Normal approx to posterior () Prior θ U(, ). Bernoulli likelihood: x = 3 successes out of n = 6 trials. Normal approx to posterior (2) Prior θ U(, ). Bernoulli likelihood: x = 3 successes out of n = 6 trials. posterior density exact (beta) approx (normal) posterior density exact (beta) approx (normal) θ θ 75 76

Bayesian Inference. Chapter 2: Conjugate models

Bayesian Inference. Chapter 2: Conjugate models Bayesian Inference Chapter 2: Conjugate models Conchi Ausín and Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in