Biostatistics Quantitative Data

Similar documents
Contents. Acknowledgments. xix

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Background to Statistics

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Business Statistics. Lecture 10: Course Review

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Solutions exercises of Chapter 7

Using SPSS for One Way Analysis of Variance

Distribution-Free Procedures (Devore Chapter Fifteen)

Data analysis and Geostatistics - lecture VII

The Chi-Square Distributions

Analysis of 2x2 Cross-Over Designs using T-Tests

Glossary for the Triola Statistics Series

The Chi-Square Distributions

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

Relating Graph to Matlab

ANOVA - analysis of variance - used to compare the means of several populations.

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Review of Statistics 101

SPSS Guide For MMI 409

Hotelling s One- Sample T2

Inferences About the Difference Between Two Means

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

20 Hypothesis Testing, Part I

Statistics: revision

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

INTRODUCTION TO ANALYSIS OF VARIANCE

MATH Notebook 3 Spring 2018

Continuous random variables

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

1 Introduction to Minitab

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Nonparametric Statistics

Non-parametric tests, part A:

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Topic 23: Diagnostics and Remedies

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

LOOKING FOR RELATIONSHIPS

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

Analysis of variance (ANOVA) Comparing the means of more than two groups

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Correlation and Regression

Wilcoxon Test and Calculating Sample Sizes

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic.

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Lectures 5 & 6: Hypothesis Testing

Pooled Variance t Test

SPSS LAB FILE 1

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

Module 9: Nonparametric Statistics Statistics (OA3102)

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Box-Cox Transformations

Data Analysis and Statistical Methods Statistics 651

6 Single Sample Methods for a Location Parameter

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

Non-parametric (Distribution-free) approaches p188 CN

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

Chapter 7 Comparison of two independent samples

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Essential Statistics Chapter 6

Ch. 16: Correlation and Regression

Lecture 18: Simple Linear Regression

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Rank-Based Methods. Lukas Meier

Intuitive Biostatistics: Choosing a statistical test

4.1. Introduction: Comparing Means

Topic 1. Definitions

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

1 A Review of Correlation and Regression

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Mathematics for Economics MA course

Nonparametric Methods

Comparison of Two Population Means

Preliminary Statistics course. Lecture 1: Descriptive Statistics

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

Review of Multiple Regression

STAT 135 Lab 8 Hypothesis Testing Review, Mann-Whitney Test by Normal Approximation, and Wilcoxon Signed Rank Test.

Exam details. Final Review Session. Things to Review

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Didacticiel Études de cas. Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples.

Frequency Distribution Cross-Tabulation

Week 7.1--IES 612-STA STA doc

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

The Empirical Rule, z-scores, and the Rare Event Approach

Transcription:

Biostatistics Quantitative Data Descriptive Statistics Statistical Models One-sample and Two-Sample Tests Introduction to SAS-ANALYST T- and Rank-Tests using ANALYST Thomas Scheike Quantitative Data This course will focus on the analysis of quantitative data which is encountered in many areas of experimental research. Data may roughly be grouped into 3 groups : Quantitative data : sperm concentration (mill/ml), height in cm, level of hormones (measured on a continuous scale). Qualitative data : sex, race, work, groupings of quantitative data (high/medium/low). Survival data : length of waiting time for some event. For some individuals, however, the event is never recorded. These individuals are censored and this makes some particular methods necessary. We will concentrate on quantitative data and describe : Descriptive techniques. (Histograms, scatter-plots, means, standard deviation, quantiles, percentiles,...) Non-parametric methods. These are based on ranks of data, and may be used for one-sample tests, two-sample tests (paired and un-paired), one-sided analysis of variance and computation of measures of association (Spearman correlation). Regression analysis techniques for normally distributed residuals. These techniques include : t-test (paired and un-paired such), analysis of variance (one- and two-sided), regression analysis, multiple regression analysis, analysis of covariance) We do, however, not discuss how to deal with repeated measures where subjects are followed and measured repeatedly. When repeated measures are encountered they may often be reduced to just one summary number for each subject and thereby analysed by techniques dealt with in this course. 1 2

Histogram of conc conc Histogram of conc conc Descriptive Statistics We consider data on sperm concentration (mill/ml) on two groups of people in a study. One group are members of an association that promotes the development of organic agriculture (n=55), and another group of workers are from a major Scandinavian airline carrier (n=141). How these data were collected is very important if we want to conclude more generally from the data. The data for both groups must be representative for the members of organic agriculture associations and airline workers. This must be very carefully validated, but for now we believe that this is the case. Drawing the data is the most important part of the statistical analysis : The Histogram The histogram is a different and better summary, it describes the distribution of the sperm concentrations for the two groups : 0 5 10 15 20 Organic farmers 0 20 40 60 80 Airline eco sas sperm concentration A histogram shows how the data is distributed, i.e., we can find out how many men that have a sperm count lower that 100 mill/ml, say. For the Airline people this is 110 (141) men and the organic farmers have 35 (55) under 100 mill/ml. It is made by grouping of the sperm concentrations and then deciding the height of each bar such that: height width = in group if bars all have the same width this is not important. A difficulty is to decide the width of the bars. Here are two different histograms: 1.0 1.2 1.4 1.6 1.8 2.0 Group Density 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Density 0.000 0.001 0.002 0.003 0.004 3 4

The Histogram The histogram describes the variability of the data. And we can approximate the chance that a data-point is below some limit, above some limit or between two limits by calculating the area of the histogram in the appropriate area : Density Area is = chance ( / number ) 0.000 0.002 0.004 0.006 0.008 Histogram of conc conc What is the probability of seeing a sperm concentration less than 40, say, from a randomly chosen man among our men in the study. 0 20 40 60 80 Percentiles Histogram of conc conc To describe the histogram we may find the data value for which 50 % of the data is above or equal to and 50 % is below or equal to, this is the median. After ordering the data in size the median is the value in the middle of the data, for an even number of data points the median is the average of the two middle values : 1 4 6 8 9 median = 6 1 4 6 7 8 9 median = (6 + 7)/2 = 6.5 Similarly the 25% percentile (quantile) is the data point for which at least 25% of the data points have a lower or equal value and at least 75 % have a higher or equal value : 1 4 6 8 9 25%percentile = 4 1 4 6 7 8 9 25%percentile = 4 Find an approximate median in the histogram? 5 6

Histogram of conc conc Histogram of conc^0.3 0 1 2 3 4 5 6 conc^0.3 Simple Summary Statistics We can calculate the mean (average) and standard-deviation for the two groups : and x = 1 x i, n i=1 Variance = 1 n (x i x) 2, n 1 SD = 1 n i=1 n (x i x) 2 n 1 i=1 The mean describes the midpoint of the data, and the standard deviation the spread of the data. These number may always be calculated. Symmetric distributions are well characterized by these numbers, whereas a skewed distribution will not be well described. normal density 0.0 0.1 0.2 0.3 0.4 Normal Distribution 4 2 0 2 4 x If a distribution does not appear symmetric one should instead compute median and various percentiles (25 % and 75 %, say) or give the range of the data (largest and smallest value). For the Sperm data the spermconcentration was 77 (77) (mean (SD)), the median and range was 56 and [0,402], respectively. What numbers are best suited to describe how the sperm concentration varies?? The Histogram The histogram based on the data is an approximation of the population the data is a representative sample from. A particularly nice histogram curve is the normal distribution : normal density 0.0 0.1 0.2 0.3 0.4 Normal Distribution 4 2 0 2 4 x which is a good approximation to many symmetric histograms. Some properties of the normal curve is : The normal curve is symmetric around its mean. It is completely described by its mean and SD. By saying that data is normally distributed we mean that the histogram of the data is close (well approximated) to the normal curve. Sometimes a transformation of the data is necessary to make this true 0 20 40 60 80 Density 0.0 0.1 0.2 0.3 0.4 7 8

The Normal Distribution Similarly, to how we use the histogram, based on the normal curve we can work out how the data is distributed. The normal curve satisfies that : 50 % of the area is under the mean. 95 % of the area is between [mean - 1.96 SD, mean + 1.96 SD]. 68 % of the area is between [mean - 1 SD, mean + 1 SD]. 2.5 % of the area is between [, mean 1.96SD]. There are tables of the standard normal distribution which has mean=0 and SD=1, and the area between two values for any other normal curve can be found using this table by converting values to standard scores. pnorm(x) 0.0 0.2 0.4 0.6 0.8 1.0 Example : The height of Danish women are approximately normal with mean 165cm and standard deviation 30cm. If a woman is chosen at random what is the chance that she is lower than 180 cm. Standard score = (180-165)/30 = 0.5, i.e., 180 is 0.5 standard deviations above the mean. The chance of being less than 0.5 in a standard normal is 0.65 Is it a reasonable statistical model?? What is the chance of a randomly chosen woman is between 190 and 175? Convert to standard scores = 0.83, 0.33 Density 0.000 0.002 0.004 0.006 0.008 0.010 0.012 Histogram of height 100 150 200 250 height 4 2 0 2 4 x The figure gives the cumulative distribution, i.e., what percent of the distribution is below a given value. The statement may formally be written as : P(X < 0.83) = 0.80; P(X < 0.33) = 0.63and P(0.33 < X < 0.83) = P(X < 0.83) P(X < 0.33) = 0.80 0.63 This is based on the following precise statement about standard scores. With Z normal with mean µ and variance σ 2 it follows that (Z µ)/σ is standard normal. 9 10

Log Normal Distribution meanlog=3,sdlog=1 Histogram for sample of 50 0 50 100 150 Distributions We often draw histogram curves to show how the data is distributed (is varying). How does these two histograms differ from the normal curve Standard Log Normal Distribution : 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Example: Suppose that the sperm-concentration in the Danish population is right skewed : 0.000 0.005 0.010 0.015 0.020 0.025 0.030 If we draw 50 men at random from this distribution we get the following numbers : 0 2 4 6 8 10 The first distribution is right skewed. i.e. data from this distribution contains some very high values. Multi Modal Histogram of Distribution c(x1, x2) 0 5 10 15 20 0 20 40 60 80 100 120 140 2 4 6 8 10 12 c(x1, x2) This other curve have several modes (multi-modal). calculations give mean=27, SD=29, median=17, range=2,250 Now, drawing again gives that : calculations give mean=34, SD=27, median=21, range=4,153 and again : calculations give mean=53, SD=115, median= 16, range=2,287 and again : calculations give mean=26, SD=31, median=20, range=2,258 11 12

Normal Distribution mean=3,sd=1 0 2 4 6 3 + x Histogram for log of sample of 50 1 2 3 4 5 Example cont d : Looking at concentrations on log-scale the population is distributed as follows : Descriptive Statistics : Summary The histogram shows how the data is distributed, i.e., how it is varying. dnorm(3 + x, 3, 1) 0.0 0.1 0.2 0.3 0.4 The area of the histogram represents frequency. The normal distribution is a histogram curve that is a good approximation to many histograms. Drawing 50 men randomly from the population gives the following histogram : calculations give mean=2.9, SD=0.99, median=2.8, The mean and standard deviation are useful summaries of how data are distributed. They should be calculated only when the data are approximately normally distributed. 0 2 4 6 8 10 The median and range are useful summaries of how data are distributed. They should be calculated when the data are not (approximately) normally distributed. range=[0.8,5.5] Now, drawing another random sample of 50 gives : mean=3.0, SD=0.85, median=3.0, range=[1.3,5.0] and again : mean=2.9, SD=1.00, median=2.9, range=[0.4,5.6] and again : mean=3.1, SD=1.07, median=3.1, range=[0.7,4.9] We conclude that for the right skewed data the mean and SD are highly variable, for the normal data the mean and SD, however, provides a very effective summary. The median stays constant for both distributions. 13 14

Normal densities 4 2 0 2 4 x Statistical Models When a physical quantity is measured several times we will get different results due to measurement error and biological variation. For example, measuring the height of a subject may yield the following histogram : 0.0 0.2 0.4 0.6 0.8 185 186 187 188 189 x What we see is variation around the average height. The variation is due to both measurement error and biological variation. Based on the above histogram it appears reasonable to claim that the variation may be described by a normal distribution. We may phrase this as a statistical model : Individual measurement = overall mean + noise If we let the individual measurements be called Y i (the observed data) the overall mean µ (unknown), and the noise ǫ, we have that Y i = µ + ǫ i This is a statistical model that describes how the observed measurements arises. The model claims that the individual observations varies around a fixed value (µ), and that the variation is ǫ. A model contains two parts: a systematic part which is of scientific interest and a random variation part which is due to biological and measurement error variation. To complete the specification of the model we also specify how the random variation ǫ i varies. We do this by specifying its distribution. It is assumed that ǫ i N(0, σ 2 ), i.e., it is normal with mean 0 and variance σ 2. 15 Estimation in Statistical Models In a statistical model one wishes to learn primarily about the parameters of the model. However, to understand what can be learned about these one must also study the variability present. In the statistical model Y i = µ + ǫ i i = 1,..., 200 where ǫ i N(0, σ 2 ) are independent noise terms. We want to know µ and σ. We may estimate these quantities by the sample average and standard deviation. ȳ = ˆµ = 1 n Y i n and 1 SD = ˆσ = (Y i ȳ) n 1 2 i=1 Looking at ȳ and using the statistical model we get that i=1 ȳ = ˆµ = µ + 1 n n n ǫ i i=1 The last term is an average of independent noise terms N(0, σ 2 ) and mathematical arguments give that it is distributed as N(0, σ2 ). So we n have described exactly what is known about µ in ˆµ through finding its distribution (N(µ, σ 2 /n)). One way to think about this is that we have a description of how the sample average is varying if we repeat the sampling. The variance of the average is n times smaller than the variance of the individual noise terms. normal density 0.0 0.5 1.0 1.5 16

Histogram for log of sample of 200 1 2 3 4 5 6 Histogram and Normal Approximation 1 2 3 4 5 6 Distribution under Null and Observed 0 2 4 6 8 Sperm analysis Scientific interest in level of sperm concentration in Danish population. We have representative sample from population. We wish to see if the level in Denmark is equal to what WHO considers the minimum level (20 mill/ml). A sample of 200 Danish men look like this : 0 10 20 30 40 The log-transformed data appears to be distributed as a normal distribution. A statistical model is now proposed to describe how the population is varying, containing a systematic part (µ) which is the average log(sperm concentration) in the population and a random variation part ǫ i, which is independent normal random variation N(0, σ 2 ) : Y i = µ + ǫ i i = 1,..., 200 We do not know µ and σ. We may estimate these quantities by the sample average and standard deviation. and SD = ˆσ = ȳ = ˆµ = 1 n 1 n i=1 n Y i = 3.9 (Y i ȳ) 2 = 0.95 n 1 i=1 This means that our best guess is that the population has mean 3.9 and the level of random variation is described by a normal distribution with standard deviation equal to 0.95 Sperm analysis, cont d Drawing the best guess at how the population is distributed against the histogram : 0 10 20 30 40 We see that the histogram and the normal curve approximate each other well. So the statistical model is validated. Which means that we have a reasonable description of the level of random variation, and a reasonable description of the systematic variation. We wish to investigate if the data is consistent with the null-hypothesis H 0 : µ = log(20), if this is not so, we are left with the alternative H A : µ log(20). The meaning of consistent with the null-hypothesis is in statistical terms equivalent to checking if the data could arise when the null-distribution is true. The null-hypothesis claims that the data is distributed around log(20), and if we use the description of the variation found above, the data should arise as a random sample from the left hand curve : 0.5 * 200 * dnorm(x + 4, log(20), slx) 0 10 20 30 40 The right hand curve is the normal approximation to the data. Formally we write Y i = log(20) + ǫ i i = 1,..., 200 ǫ N(0, 0.95 2 ). 17 18

4 2 0 2 4 x 4 2 0 2 4 x 4 2 0 2 4 x Sperm analysis, cont d The question now is : how well does this fit with the average we found in our data at 3.9? The sample average is distributed as N(µ, σ 2 /n), so if H 0 is true, the sample average is varying around log(20) with a standard deviation at σ/ n (which we estimate as σ/ n = 0.95/14 = 0.05). Thus our guess at how the average is varying under the null is N(log(20),(0.05) 2 ). dnorm(x + 4, log(20), slx/200^0.5) 0 1 2 3 4 5 6 Distribution of Mean under Null (log(20) 0 2 4 6 8 x + 4 How well does this fit with the data?? Sperm analysis, The t-test To further summarize how the observed sample average compares to the null-hypothesis we can calculate how many standard deviations it is different from the null-hypothesis : T = ȳ log(20) SD/ n = 18 which is t-distributed with n 1 = degrees of freedom (p < 0.0001). We define SEM = SD/ n, the standard error of the mean. A t-distribution is varying slightly more than a normal : dnorm(x, 0, 1) 0.0 0.1 0.2 0.3 0.4 t dist f=199 and Normal dnorm(x, 0, 1) 0.0 0.1 0.2 0.3 0.4 t dist f=19 and Normal dnorm(x, 0, 1) 0.0 0.1 0.2 0.3 0.4 t dist f=9 and Normal because we had only a variable guess on the SD of the population. Note that the t-test is on the form T = observed expected standard errror of observed We now calculate the chance of getting a test-statistic as extreme as or more extreme than the observed one. The chance is computed under the null H 0 (the p-value). The smaller this chance is the more evidence against the null. If the p-value is less than 5% we reject the null (at a 5 % level). 19 20

Statistical Models The random variation in a statistical model is described by a distribution. Often a normal distribution. The random variation may consist of several components depending on the context. Different sources may be : Measurement error. Inter-individual variation. Intra-individual variation. Variation over time. Statistical Models, Summary The recipe when doing statistical analysis : Scientific hypothesis is formulated. Make graphs of data, to get a feel for the data, and the variability. Statistical model is proposed and validated. Systematic variation, contains parameters about which the scientific hypotheses is formulated. Random variation described as normal N(0, σ 2 ). Inference about parameters may be drawn in statistical model. The random variation is not the object of interest but we must anyway specify a model for it that appears reasonable to correctly understand how much that can be learned about the systematic part of the variation. 21 22

Histogram log ECO 2 3 4 5 6 log(eco[eco > 0]) Histogram log SAS 0 1 2 3 4 5 6 log(sas[sas > 0]) One-sample Comparison s, the t-test Consider the 55 ecological farmers and the 141 airline workers : 0 5 10 15 20 Organic farmers eco 0 20 40 60 80 Airline We now wish to investigate if the sperm-level is equal to the level 40 mill/ml (found in the literature) for the group of ecological farmers. A statistical model is Y i = µ + ǫ i i = 1,..., 55 where ǫ N(0, σ 2 ) are independent noise terms. We know that the data is approximately normal when considered on a log-scale : 0 5 10 15 0 5 10 15 20 25 sas The t-test The one-sample t-test for the hypothesis H 0 : µ = log(40) versus H A : µ log(40). The null claims that we see is a sample from a population that varies symmetrically around log(40). T-test for H 0 is T = ȳ log(40) SEM = 0.51/0.14 = 3.6 which should be looked up in t-distribution with 54 = 55 1 degrees of freedom, where SEM=SD/ n. We get a p-value at 0.001. Thus, if the null was true and we drew 55 men from the population we would get an average as different or more than the observed average with a chance at 0.001. We conclude that the sperm-level is significantly higher than 40 mill/ml in population of ecological farmers. A 95 % confidence interval for mean-values we can not reject by a 5 % test are : (ˆµ 1.96 SD/ n, ˆµ + 1.96 SD/ n) (4.2 1.96 0.14, 4.2 + 1.96 0.14) = (3.9, 4.4) This is the range of values for the mean of the sperm-concentration we believe in. and and therefore investigate the scientific hypothesis on this scale. Estimate µ and σ by sample average and sample standard deviation SD = ˆσ = ȳ = ˆµ = 1 n 1 n 1 n i=1 n i=1 Y i = 4.2 (Y i ȳ) 2 = 0.96 23 24

A Non-parametric One-sample Test, The signed-rank test Non-parametric techniques avoids the assumption of normally distributed residuals, and instead ask questions about the median for the population. Still looking at the ecological farmers. We now take a subset of 10 men: 22 36 55 58 70 74 80 89 120 200 and wish to test if they vary symmetrically around 40 mill/ml. We do not specify a detailed statistical model but want to test if H 0 : Distribution symmetric around 0 versus H A : Distribution not symmetric. (skewed for example) We make a Wilcoxon one-sample test a signed rank test. Subtracting 40 from each of the sperm levels we get -18-4 15 18 30 34 40 49 80 160 3.5 1 2 3.5 5 6 7 8 9 10 Ordering these after absolute size and assigning them ranks. We check if the sum of the rank s of the negative values are as big as the ranks of the positive values, as it should be under symmetry. The ranks of the negative numbers are 4.5. We look it up in statistical table. The p-value is p > 0.01 and p < 0.02. Doing the test on all the data gives a p-value at 0.001. One may use a normal approximation to compute the p-value, i.e., compute µ = n(n + 1)/4 and σ = n(n + 1)(2n + 1)/24, and Z = T µ σ for n > 20. For smaller values of n use a table. 25 Two-sample Comparison s, the t-test Consider the 55 ecological farmers and the 141 airline workers on a log-scale : 0 5 10 15 Histogram log ECO 2 3 4 5 6 log(eco[eco > 0]) 0 5 10 15 20 25 Histogram log SAS 0 1 2 3 4 5 6 log(sas[sas > 0]) One may want to know if these two groups really could be varying around the same level, and that the differences we see is due to random variation. We start by proposing a statistical model in which we can answer the question: Y i,j = µ i + ǫ i,j i = 1, 2, j = 1,...n i where ǫ i,j N(0, σ 2 i ) are independent noise terms. The histograms of the data shows that the model is a good description of the data on log-scale. Estimating the mean and variability in the two populations underlying the samples give that µ 1 = 3.9 σ1 2 = 1.08 µ 2 = 4.2 σ2 2 = 0.90 26

Two-sample Comparison s, the t-test To carry out a two-sample t-test we first need to check if the variability is the same in the two groups. We test if H 0 : σ 1 = σ 2 versus H A : σ 1 σ 2. And use the following test-statistic : F = max(σ2 1, σ 2 2) min(σ 2 1, σ 2 2) = 1.08 0.90 = 1.27 which we should look up in F distribution with (140, 54) degrees of freedom (p=0.32). So we accept hypothesis. Now we can calculate a combined estimate of the variability SD 2 = (n 1 1)σ1 2 + (n 2 1)σ2 2 (n 1 1) + (n 2 1) = 54 0.902 + 140 1.08 2 = 1.02 55 1 + 141 1 With the combined variability estimate SD we can proceed to the twosample T-test for H 0 : µ 1 = µ 2 versus H A : µ 1 µ 2 T = ȳ 1 ȳ 2 SD (1/n 1 ) + (1/n 2 ) = 2.82 which we look up in t-distribution with n 1 + n 2 2 = f 1 + f 2 degrees of freedom. (p=0.006). We conclude that the ecological farmers have a significantly higher sperm-level than the airline workers. A 95 % confidence interval for the difference in means of the two groups are given by : (ȳ 1 ȳ 2 1.96 SED, ȳ 1 ȳ 2 + 1.96 SED) = (0.3 2 0.1, 0.3 + 2 0.1) where SED = SD ( (1/n 1 ) + (1/n 2 )). Non-parametric Two-sample Comparison s, The rank test The non-parametric rank test is also called the Wilcoxon-Mann-Whitney test. Consider two groups of data as before. We now wish to test if the distribution of the two population could be equal, or if this must be rejected by a test. The statistical model : : Y i,j arbitrary distribution F i ( ). : All data points are independent. In this non-parametric model we wish to test if : H 0 : Distributions are the same versus H A : Distributions are not the same. We calculate a test-statistic as follows: Pool all data and assign ranks. Sum ranks of smallest group. Look sum of ranks up in statistical table to get p-value. Sum of ranks, T, for ecological farmers is 6342 (total sum of ranks is 19306, and 19306 * (55/196) = 5405) which result in p-value at 0.0096 (computer program). One may use a normal approximation to compute the p-value, i.e., compute µ = n 1 (n 1 + n 2 + 1)/2 (5390) and σ = n 2 µ/6 (356), and Z = T µ σ for n 1, n 2 > 10. For smaller values, use a table. 27 28

Paired Comparison s When data is paired the two measurements often are not independent: Make graphs of data. Summary Measuring right- and left bicep. Growth before and after treatment. Height of men of women when sampled as couples. With only two correlated measurements, the data may anyway be analysed by simple techniques. A correct analysis is obtained by making one-sample analysis on the differences. The differences between the before and after measurements are namely independent among subjects. Therefore one should simply test if the differences are varying around 0, by either a t-test or a signedrank-test. When investigating the effect of some drug that prevents sun-burn, say, we could apply the sun-lotion to one arm and placebo to the other. The difference between the arms may be ascribed to the lotion. The difference is a measure that is corrected for inter-individual variation, which may be large. One-sample test: When the variation is approximately normal the t-test may be used to test a hypothesis about the mean of the underlying population. The p-value provided is only valid if the variation is approximately normal. A nice summary of data is provided by the confidence interval of the mean. When data is not normally distributed and interest is concentrated on inference rather than estimates the signed-rank-test may be used. This test is always valid. No confidence intervals may be given. Right skewed data may be transformed to approximate normality by transformations like x, x 1/3, log. Two-sample test: Two groups of data may be compared by the t-test when the variation is approximately normal and the variance of the residual variation is equal in the two groups. A nice summary of difference between the groups are given by the confidence interval for the difference between the means. When data is not normally distributed and interest is concentrated on inference rather than estimates the rank-test may be used. This test is always valid. No confidence intervals may be given. Paired data is handled by sample techniques on the differences between the pairs. 29 30

Statistical Analysis using Analyst (SAS) Analyst is a windows based application in the SAS statistical software. SAS is activated by clicking : start statistik SAS in the lower lefthand corner. Analyst is activated after solutions analysis Analyst Commands will be presented as we need them for the various analyses, and remember that the focus is on the statistical analyses rather than how one do this and that. We consider data on sperm concentration (mill/ml) on two groups of people in a study. One group are members of an association that promotes the development of organic agriculture (n=55), and another group of workers from a major Scandinavian airline carrier (n=141). now type a new name e.g. oeko12 if you are in from of machine 12. The data-set contains the following variables : obs observation number. abstime length of abstinence in days. age age of subject. s1e2 group indicator. conc sperm concentration (mill/ml). volume volume of sperm sample (ml). The data is loaded file open... from n:\human\oeko that is a SAS data-set. Doing this the data will appear in the data table. It consists of a record for each subject with the variables described above. To make your own new variables when you work with the data you must create your own version of the data. You do this by saving your own version of the data under a new name : File Save... 31 32

Data Manipulations A little bit of data manipulation is needed. New transformed variables are constructed by setting the data frame in edit mode edit mode edit and then data transform compute... now type new variable name (e.g. conc3) and an expression that defines the new variable in the box below the equality-sign (e.g. conc**.3333). Now, a new variable called conc3 that is equal to concentrations on cube-root scale is defined and appears in the data table. Data Manipulations To group a continuous variable according to its value and to define a classification variable based on it : data transform recode ranges... in the recode dialog give column name (volume) and name of the new grouped version (gvol) and click ok. Now in the next window give the bounds 0,3; 3,4, and 4,15 for the first three groups and name them (1,2,3) in the rightmost column, click ok. To delete variable highlite the column in the data-table : edit delete Alternatively, one may take on of the standard transformations like conc after highlighting the column one wishes to transform by data transform To make a variable that can be used for the one-sample test (e.g. ld40=lconc-log(40)) data transform compute... now type new variable name (ld40) and the expression that defines the new variable in the box below the equality sign log(conc)-log(40). To construct a subset of the data, e.g., the subset of ecological farmers for an specific analysis for this group : data filter subset data... in the subset dialog you can apply a Where clause to the data (click s1e2 and eq and constant value followed by 1 to select s1e2=1 the Airline workers). 33 34

Histograms To make a histogram of concentration ( conc ) graphs histogram... select conc as the analysis variable and s1e2 as the class variable (the classification variable). If the class variable is omitted no-classification variable will used. Now, clicking ok does the job. Simple descriptive Statistics To compute mean, standard deviations, variances, medians and percentiles as well as the range statistics descriptive distributions... select conc as the analysis variable and s1e2 as the class variable (the classification variable). Now, clicking ok does the job. 0 5 10 15 20 Organic farmers 0 20 40 60 80 Airline Output ----------------------------- S1E2=1 ------------------------------ Univariate Procedure Variable=CONC Moments N 141 Sum Wgts 141 Mean 69.16461 Sum 9752.21 Std Dev 70.17659 Variance 4924.753 Skewness 2.172157 Kurtosis 5.780222 USS 1363973 CSS 689465.5 CV 101.4631 Std Mean 5.909935 T:Mean=0 11.70311 Pr> T 0.0001 Num ^= 0 139 Num > 0 139 M(Sign) 69.5 Pr>= M 0.0001 Sgn Rank 4865 Pr>= S 0.0001 eco sas Quantiles(Def=5) To examine the normality of a variable one may draw the histogram for a normal distribution on the same plot. To do this click fit in the distribution-dialog and and select normal and ok in the fit-dialog before clicking ok on the distribution-dialog. 100% Max 402 99% 358 75% Q3 91 95% 209 50% Med 48 90% 158 25% Q1 23 10% 12 0% Min 0 5% 3.3 1% 0 Range 402 Q3-Q1 68 Mode 12 Extremes Lowest Obs Highest Obs 0( 40) 233( 92) 0( 1) 284( 102) 0.75( 67) 308( 32) 1.88( 60) 358( 104) 2.3( 132) 402( 69) ----------------------------- S1E2=2 ------------------------------ Univariate Procedure 35 36

Variable=CONC Moments N 55 Sum Wgts 55 Mean 99.04727 Sum 5447.6 Std Dev 86.39382 Variance 7463.891 Skewness 1.339362 Kurtosis 1.118476 USS 942620.1 CSS 403050.1 CV 87.22483 Std Mean 11.64934 T:Mean=0 8.502394 Pr> T 0.0001 Num ^= 0 54 Num > 0 54 M(Sign) 27 Pr>= M 0.0001 Sgn Rank 742.5 Pr>= S 0.0001 Quantiles(Def=5) 100% Max 354 99% 354 75% Q3 138 95% 297 50% Med 69 90% 259 25% Q1 33 10% 15 0% Min 0 5% 9.1 1% 0 Range 354 Q3-Q1 105 Mode 69 Extremes Lowest Obs Highest Obs 0( 40) 264( 32) 5.5( 15) 264( 33) 9.1( 35) 297( 47) 11( 42) 322( 14) 14( 10) 354( 51) One-sample T-test and Signed Rank Test We wish to examine if the hypothesis that the sperm level varies around 40 mill/ml can be statistically rejected or validated. To make a one-sample t-test first transform to log-scale to obtain approximate normality and then compute a new variable dl40=lconc-log(40) (see above). Now, statistics descriptive distributions... selecting the variable dl40 and with class equal to s1e2 does the job. Output: ----------------------------- S1E2=1 ------------------------------ Univariate Procedure Variable=DL40 Moments N 139 Sum Wgts 139 Mean 0.091883 Sum 12.7718 Std Dev 1.080798 Variance 1.168125 Skewness -0.79219 Kurtosis 1.421361 USS 162.3748 CSS 161.2013 CV 1176.271 Std Mean 0.091672 T:Mean=0 1.002305 Pr> T 0.3180 Num ^= 0 139 Num > 0 79 M(Sign) 9.5 Pr>= M 0.1265 Sgn Rank 816 Pr>= S 0.0862 Quantiles(Def=5) 100% Max 2.307573 99% 2.191654 75% Q3 0.832909 95% 1.704748 50% Med 0.182322 90% 1.373716 25% Q1-0.51083 10% -1.20397 0% Min -3.97656 5% -1.63476 1% -3.05761 Range 6.284134 Q3-Q1 1.343735 Mode -1.20397 Extremes Lowest Obs Highest Obs -3.97656( 67) 1.762159( 92) -3.05761( 60) 1.960095( 102) -2.85597( 132) 2.04122( 32) -2.69563( 111) 2.191654( 104) -2.56395( 49) 2.307573( 69) 37 38

Missing Value. Count 2 % Count/Nobs 1.42 ----------------------------- S1E2=2 ------------------------------ Univariate Procedure Variable=DL40 Moments N 54 Sum Wgts 54 Mean 0.541879 Sum 29.26144 Std Dev 0.958596 Variance 0.918905 Skewness -0.50364 Kurtosis -0.00368 USS 64.55813 CSS 48.70198 CV 176.9023 Std Mean 0.130448 T:Mean=0 4.153971 Pr> T 0.0001 Num ^= 0 54 Num > 0 41 M(Sign) 14 Pr>= M 0.0002 Sgn Rank 428.5 Pr>= S 0.0001 Quantiles(Def=5) 100% Max 2.180417 99% 2.180417 75% Q3 1.238374 95% 2.004853 50% Med 0.545227 90% 1.867949 25% Q1 0.09531 10% -0.85567 0% Min -1.98413 5% -1.29098 1% -1.98413 Range 4.164549 Q3-Q1 1.143064 Mode 0.545227 Extremes Lowest Obs Highest Obs -1.98413( 15) 1.88707( 32) -1.48061( 35) 1.88707( 33) -1.29098( 42) 2.004853( 47) -1.04982( 10) 2.085672( 14) -0.98083( 29) 2.180417( 51) One-sample T-test Alternatively one may use a special menu that has been designed especially for the one-sample t-test statistics hypothesis tests One-sample t-test... selecting the variable lconc and entering the mean we wish to test as 4. Note that the t-test should be carried out only the group of ecological farmers, say, and that the active data-set therefore should be only this group. To make the test it is necessary to construct a new data set that consists of the group of interest as done in the data manipulation section above. Output: One Sample T Test for a Mean Sample Statistics for LCONC N Mean Std. Dev. Std. Error ------------------------------------------------- 193 3.91 1.07 0.08 Hypothesis Test Null hypothesis: Mean of LCONC = 4 Alternative: Mean of LCONC ^= 4 t Statistic Df Prob > t --------------------------------- -1.217 192 0.2249 To make the t-test of the two groups, you can specify that you want it done for the two groups under the variables button, by given s1e2 as the by variable. Missing Value. Count 1 % Count/Nobs 1.82 39 40

Two-sample T-test for Means (un-paired data) To compare the concentrations for the two groups statistics hypothesis tests Two-sample t-test for means... selecting the variable lconc and the group variable s1e2. Output: Two Sample T Test for the Means of LCONC within S1E2 Sample Statistics Group N Mean Std. Dev. Std. Error -------------------------------------------------- 1 139 3.780763 1.0808 0.0917 2 54 4.230758 0.9586 0.1304 Hypothesis Test Null hypothesis: Mean 1 - Mean 2 = 0 Alternative: Mean 1 - Mean 2 ^= 0 If Variances Are t statistic Df Pr > t ---------------------------------------------------- Equal -2.677 191 0.0081 Not Equal -2.822 108.14 0.0057 Two-sample T-test for Variances (un-paired data) To compare the concentrations for the two groups statistics hypothesis tests Two-sample t-test for variances... selecting the variable lconc and the group variable s1e2. Output: Two Sample Test for Variances of LCONC within S1E2 Sample Statistics S1E2 Group N Mean Std. Dev. Variance -------------------------------------------------- 1 139 3.7808 1.0808 1.1681 2 54 4.2308 0.9586 0.9189 Hypothesis Test Null hypothesis: Variance 1 / Variance 2 = 1 Alternative: Variance 1 / Variance 2 ^= 1 - Degrees of Freedom - F Numer. Denom. Pr > F ---------------------------------------------- 1.27 138 53 0.3203 It is useful to supplement the analysis with some plots. Try for example the plots button, and select one of the plots. The conclusions are based on an assumption of equal variances, and this should be validated. The output may indicate that this is the case, but if in doubt one can carry out a test that shows have serious the deviation from equal variances are. 41 42

Two-Sample Signed Rank Test The two-sample signed rank test can more generally by considered as a special case of the Kruskal-Wallis test that test if k groups have the same distribution. To carry out the two-sample signed rank test : statistics ANOVA non-parametric one-way ANOVA... selecting the variable conc and the group variable s1e2. Output: Wilcoxon Scores (Rank Sums) for Variable CONC Classified by Variable S1E2 Sum of Expected Std Dev Mean S1E2 N Scores Under H0 Under H0 Score 1 141 12964.0 13888.5000 356.782425 91.943262 2 55 6342.0 5417.5000 356.782425 115.309091 Average Scores Were Used for Ties Exercise-I Rather than considering the concentration we shall now consider the volume of each sperm sample as the parameter of interest. We wish to compare the ecological farmers and the airline workers. A volume of 3 ml is considered normal. Investigate further if the two groups are normal in this respect. 3) Without doing any computer work make a strategy for how such an analyses can and should be carried out. What descriptive plots and statistics are needed? What hypothesis are formulated and tested? How will you validate the necessary assumptions for the suggested analysis? 4) Do the analyses, make the plots and so on. Remember to interpret the results according to the subject matter. Wilcoxon 2-Sample Test (Normal Approximation) (with Continuity Correction of.5) S = 6342.00 Z = 2.58981 Prob > Z = 0.0096 T-Test Approx. Significance = 0.0103 Kruskal-Wallis Test (Chi-Square Approximation) CHISQ = 6.7144 DF = 1 Prob > CHISQ = 0.0096 43 44