Biostatistics Quantitative Data

Size: px
Start display at page:

Download "Biostatistics Quantitative Data"

Transcription

1 Biostatistics Quantitative Data Descriptive Statistics Statistical Models One-sample and Two-Sample Tests Introduction to SAS-ANALYST T- and Rank-Tests using ANALYST Thomas Scheike Quantitative Data This course will focus on the analysis of quantitative data which is encountered in many areas of experimental research. Data may roughly be grouped into 3 groups : Quantitative data : sperm concentration (mill/ml), height in cm, level of hormones (measured on a continuous scale). Qualitative data : sex, race, work, groupings of quantitative data (high/medium/low). Survival data : length of waiting time for some event. For some individuals, however, the event is never recorded. These individuals are censored and this makes some particular methods necessary. We will concentrate on quantitative data and describe : Descriptive techniques. (Histograms, scatter-plots, means, standard deviation, quantiles, percentiles,...) Non-parametric methods. These are based on ranks of data, and may be used for one-sample tests, two-sample tests (paired and un-paired), one-sided analysis of variance and computation of measures of association (Spearman correlation). Regression analysis techniques for normally distributed residuals. These techniques include : t-test (paired and un-paired such), analysis of variance (one- and two-sided), regression analysis, multiple regression analysis, analysis of covariance) We do, however, not discuss how to deal with repeated measures where subjects are followed and measured repeatedly. When repeated measures are encountered they may often be reduced to just one summary number for each subject and thereby analysed by techniques dealt with in this course. 1 2

2 Histogram of conc conc Histogram of conc conc Descriptive Statistics We consider data on sperm concentration (mill/ml) on two groups of people in a study. One group are members of an association that promotes the development of organic agriculture (n=55), and another group of workers are from a major Scandinavian airline carrier (n=141). How these data were collected is very important if we want to conclude more generally from the data. The data for both groups must be representative for the members of organic agriculture associations and airline workers. This must be very carefully validated, but for now we believe that this is the case. Drawing the data is the most important part of the statistical analysis : The Histogram The histogram is a different and better summary, it describes the distribution of the sperm concentrations for the two groups : Organic farmers Airline eco sas sperm concentration A histogram shows how the data is distributed, i.e., we can find out how many men that have a sperm count lower that 100 mill/ml, say. For the Airline people this is 110 (141) men and the organic farmers have 35 (55) under 100 mill/ml. It is made by grouping of the sperm concentrations and then deciding the height of each bar such that: height width = in group if bars all have the same width this is not important. A difficulty is to decide the width of the bars. Here are two different histograms: Group Density Density

3 The Histogram The histogram describes the variability of the data. And we can approximate the chance that a data-point is below some limit, above some limit or between two limits by calculating the area of the histogram in the appropriate area : Density Area is = chance ( / number ) Histogram of conc conc What is the probability of seeing a sperm concentration less than 40, say, from a randomly chosen man among our men in the study Percentiles Histogram of conc conc To describe the histogram we may find the data value for which 50 % of the data is above or equal to and 50 % is below or equal to, this is the median. After ordering the data in size the median is the value in the middle of the data, for an even number of data points the median is the average of the two middle values : median = median = (6 + 7)/2 = 6.5 Similarly the 25% percentile (quantile) is the data point for which at least 25% of the data points have a lower or equal value and at least 75 % have a higher or equal value : %percentile = %percentile = 4 Find an approximate median in the histogram? 5 6

4 Histogram of conc conc Histogram of conc^ conc^0.3 Simple Summary Statistics We can calculate the mean (average) and standard-deviation for the two groups : and x = 1 x i, n i=1 Variance = 1 n (x i x) 2, n 1 SD = 1 n i=1 n (x i x) 2 n 1 i=1 The mean describes the midpoint of the data, and the standard deviation the spread of the data. These number may always be calculated. Symmetric distributions are well characterized by these numbers, whereas a skewed distribution will not be well described. normal density Normal Distribution x If a distribution does not appear symmetric one should instead compute median and various percentiles (25 % and 75 %, say) or give the range of the data (largest and smallest value). For the Sperm data the spermconcentration was 77 (77) (mean (SD)), the median and range was 56 and [0,402], respectively. What numbers are best suited to describe how the sperm concentration varies?? The Histogram The histogram based on the data is an approximation of the population the data is a representative sample from. A particularly nice histogram curve is the normal distribution : normal density Normal Distribution x which is a good approximation to many symmetric histograms. Some properties of the normal curve is : The normal curve is symmetric around its mean. It is completely described by its mean and SD. By saying that data is normally distributed we mean that the histogram of the data is close (well approximated) to the normal curve. Sometimes a transformation of the data is necessary to make this true Density

5 The Normal Distribution Similarly, to how we use the histogram, based on the normal curve we can work out how the data is distributed. The normal curve satisfies that : 50 % of the area is under the mean. 95 % of the area is between [mean SD, mean SD]. 68 % of the area is between [mean - 1 SD, mean + 1 SD]. 2.5 % of the area is between [, mean 1.96SD]. There are tables of the standard normal distribution which has mean=0 and SD=1, and the area between two values for any other normal curve can be found using this table by converting values to standard scores. pnorm(x) Example : The height of Danish women are approximately normal with mean 165cm and standard deviation 30cm. If a woman is chosen at random what is the chance that she is lower than 180 cm. Standard score = ( )/30 = 0.5, i.e., 180 is 0.5 standard deviations above the mean. The chance of being less than 0.5 in a standard normal is 0.65 Is it a reasonable statistical model?? What is the chance of a randomly chosen woman is between 190 and 175? Convert to standard scores = 0.83, 0.33 Density Histogram of height height x The figure gives the cumulative distribution, i.e., what percent of the distribution is below a given value. The statement may formally be written as : P(X < 0.83) = 0.80; P(X < 0.33) = 0.63and P(0.33 < X < 0.83) = P(X < 0.83) P(X < 0.33) = This is based on the following precise statement about standard scores. With Z normal with mean µ and variance σ 2 it follows that (Z µ)/σ is standard normal. 9 10

6 Log Normal Distribution meanlog=3,sdlog=1 Histogram for sample of Distributions We often draw histogram curves to show how the data is distributed (is varying). How does these two histograms differ from the normal curve Standard Log Normal Distribution : Example: Suppose that the sperm-concentration in the Danish population is right skewed : If we draw 50 men at random from this distribution we get the following numbers : The first distribution is right skewed. i.e. data from this distribution contains some very high values. Multi Modal Histogram of Distribution c(x1, x2) c(x1, x2) This other curve have several modes (multi-modal). calculations give mean=27, SD=29, median=17, range=2,250 Now, drawing again gives that : calculations give mean=34, SD=27, median=21, range=4,153 and again : calculations give mean=53, SD=115, median= 16, range=2,287 and again : calculations give mean=26, SD=31, median=20, range=2,

7 Normal Distribution mean=3,sd= x Histogram for log of sample of Example cont d : Looking at concentrations on log-scale the population is distributed as follows : Descriptive Statistics : Summary The histogram shows how the data is distributed, i.e., how it is varying. dnorm(3 + x, 3, 1) The area of the histogram represents frequency. The normal distribution is a histogram curve that is a good approximation to many histograms. Drawing 50 men randomly from the population gives the following histogram : calculations give mean=2.9, SD=0.99, median=2.8, The mean and standard deviation are useful summaries of how data are distributed. They should be calculated only when the data are approximately normally distributed The median and range are useful summaries of how data are distributed. They should be calculated when the data are not (approximately) normally distributed. range=[0.8,5.5] Now, drawing another random sample of 50 gives : mean=3.0, SD=0.85, median=3.0, range=[1.3,5.0] and again : mean=2.9, SD=1.00, median=2.9, range=[0.4,5.6] and again : mean=3.1, SD=1.07, median=3.1, range=[0.7,4.9] We conclude that for the right skewed data the mean and SD are highly variable, for the normal data the mean and SD, however, provides a very effective summary. The median stays constant for both distributions

8 Normal densities x Statistical Models When a physical quantity is measured several times we will get different results due to measurement error and biological variation. For example, measuring the height of a subject may yield the following histogram : x What we see is variation around the average height. The variation is due to both measurement error and biological variation. Based on the above histogram it appears reasonable to claim that the variation may be described by a normal distribution. We may phrase this as a statistical model : Individual measurement = overall mean + noise If we let the individual measurements be called Y i (the observed data) the overall mean µ (unknown), and the noise ǫ, we have that Y i = µ + ǫ i This is a statistical model that describes how the observed measurements arises. The model claims that the individual observations varies around a fixed value (µ), and that the variation is ǫ. A model contains two parts: a systematic part which is of scientific interest and a random variation part which is due to biological and measurement error variation. To complete the specification of the model we also specify how the random variation ǫ i varies. We do this by specifying its distribution. It is assumed that ǫ i N(0, σ 2 ), i.e., it is normal with mean 0 and variance σ Estimation in Statistical Models In a statistical model one wishes to learn primarily about the parameters of the model. However, to understand what can be learned about these one must also study the variability present. In the statistical model Y i = µ + ǫ i i = 1,..., 200 where ǫ i N(0, σ 2 ) are independent noise terms. We want to know µ and σ. We may estimate these quantities by the sample average and standard deviation. ȳ = ˆµ = 1 n Y i n and 1 SD = ˆσ = (Y i ȳ) n 1 2 i=1 Looking at ȳ and using the statistical model we get that i=1 ȳ = ˆµ = µ + 1 n n n ǫ i i=1 The last term is an average of independent noise terms N(0, σ 2 ) and mathematical arguments give that it is distributed as N(0, σ2 ). So we n have described exactly what is known about µ in ˆµ through finding its distribution (N(µ, σ 2 /n)). One way to think about this is that we have a description of how the sample average is varying if we repeat the sampling. The variance of the average is n times smaller than the variance of the individual noise terms. normal density

9 Histogram for log of sample of Histogram and Normal Approximation Distribution under Null and Observed Sperm analysis Scientific interest in level of sperm concentration in Danish population. We have representative sample from population. We wish to see if the level in Denmark is equal to what WHO considers the minimum level (20 mill/ml). A sample of 200 Danish men look like this : The log-transformed data appears to be distributed as a normal distribution. A statistical model is now proposed to describe how the population is varying, containing a systematic part (µ) which is the average log(sperm concentration) in the population and a random variation part ǫ i, which is independent normal random variation N(0, σ 2 ) : Y i = µ + ǫ i i = 1,..., 200 We do not know µ and σ. We may estimate these quantities by the sample average and standard deviation. and SD = ˆσ = ȳ = ˆµ = 1 n 1 n i=1 n Y i = 3.9 (Y i ȳ) 2 = 0.95 n 1 i=1 This means that our best guess is that the population has mean 3.9 and the level of random variation is described by a normal distribution with standard deviation equal to 0.95 Sperm analysis, cont d Drawing the best guess at how the population is distributed against the histogram : We see that the histogram and the normal curve approximate each other well. So the statistical model is validated. Which means that we have a reasonable description of the level of random variation, and a reasonable description of the systematic variation. We wish to investigate if the data is consistent with the null-hypothesis H 0 : µ = log(20), if this is not so, we are left with the alternative H A : µ log(20). The meaning of consistent with the null-hypothesis is in statistical terms equivalent to checking if the data could arise when the null-distribution is true. The null-hypothesis claims that the data is distributed around log(20), and if we use the description of the variation found above, the data should arise as a random sample from the left hand curve : 0.5 * 200 * dnorm(x + 4, log(20), slx) The right hand curve is the normal approximation to the data. Formally we write Y i = log(20) + ǫ i i = 1,..., 200 ǫ N(0, )

10 x x x Sperm analysis, cont d The question now is : how well does this fit with the average we found in our data at 3.9? The sample average is distributed as N(µ, σ 2 /n), so if H 0 is true, the sample average is varying around log(20) with a standard deviation at σ/ n (which we estimate as σ/ n = 0.95/14 = 0.05). Thus our guess at how the average is varying under the null is N(log(20),(0.05) 2 ). dnorm(x + 4, log(20), slx/200^0.5) Distribution of Mean under Null (log(20) x + 4 How well does this fit with the data?? Sperm analysis, The t-test To further summarize how the observed sample average compares to the null-hypothesis we can calculate how many standard deviations it is different from the null-hypothesis : T = ȳ log(20) SD/ n = 18 which is t-distributed with n 1 = degrees of freedom (p < ). We define SEM = SD/ n, the standard error of the mean. A t-distribution is varying slightly more than a normal : dnorm(x, 0, 1) t dist f=199 and Normal dnorm(x, 0, 1) t dist f=19 and Normal dnorm(x, 0, 1) t dist f=9 and Normal because we had only a variable guess on the SD of the population. Note that the t-test is on the form T = observed expected standard errror of observed We now calculate the chance of getting a test-statistic as extreme as or more extreme than the observed one. The chance is computed under the null H 0 (the p-value). The smaller this chance is the more evidence against the null. If the p-value is less than 5% we reject the null (at a 5 % level)

11 Statistical Models The random variation in a statistical model is described by a distribution. Often a normal distribution. The random variation may consist of several components depending on the context. Different sources may be : Measurement error. Inter-individual variation. Intra-individual variation. Variation over time. Statistical Models, Summary The recipe when doing statistical analysis : Scientific hypothesis is formulated. Make graphs of data, to get a feel for the data, and the variability. Statistical model is proposed and validated. Systematic variation, contains parameters about which the scientific hypotheses is formulated. Random variation described as normal N(0, σ 2 ). Inference about parameters may be drawn in statistical model. The random variation is not the object of interest but we must anyway specify a model for it that appears reasonable to correctly understand how much that can be learned about the systematic part of the variation

12 Histogram log ECO log(eco[eco > 0]) Histogram log SAS log(sas[sas > 0]) One-sample Comparison s, the t-test Consider the 55 ecological farmers and the 141 airline workers : Organic farmers eco Airline We now wish to investigate if the sperm-level is equal to the level 40 mill/ml (found in the literature) for the group of ecological farmers. A statistical model is Y i = µ + ǫ i i = 1,..., 55 where ǫ N(0, σ 2 ) are independent noise terms. We know that the data is approximately normal when considered on a log-scale : sas The t-test The one-sample t-test for the hypothesis H 0 : µ = log(40) versus H A : µ log(40). The null claims that we see is a sample from a population that varies symmetrically around log(40). T-test for H 0 is T = ȳ log(40) SEM = 0.51/0.14 = 3.6 which should be looked up in t-distribution with 54 = 55 1 degrees of freedom, where SEM=SD/ n. We get a p-value at Thus, if the null was true and we drew 55 men from the population we would get an average as different or more than the observed average with a chance at We conclude that the sperm-level is significantly higher than 40 mill/ml in population of ecological farmers. A 95 % confidence interval for mean-values we can not reject by a 5 % test are : (ˆµ 1.96 SD/ n, ˆµ SD/ n) ( , ) = (3.9, 4.4) This is the range of values for the mean of the sperm-concentration we believe in. and and therefore investigate the scientific hypothesis on this scale. Estimate µ and σ by sample average and sample standard deviation SD = ˆσ = ȳ = ˆµ = 1 n 1 n 1 n i=1 n i=1 Y i = 4.2 (Y i ȳ) 2 =

13 A Non-parametric One-sample Test, The signed-rank test Non-parametric techniques avoids the assumption of normally distributed residuals, and instead ask questions about the median for the population. Still looking at the ecological farmers. We now take a subset of 10 men: and wish to test if they vary symmetrically around 40 mill/ml. We do not specify a detailed statistical model but want to test if H 0 : Distribution symmetric around 0 versus H A : Distribution not symmetric. (skewed for example) We make a Wilcoxon one-sample test a signed rank test. Subtracting 40 from each of the sperm levels we get Ordering these after absolute size and assigning them ranks. We check if the sum of the rank s of the negative values are as big as the ranks of the positive values, as it should be under symmetry. The ranks of the negative numbers are 4.5. We look it up in statistical table. The p-value is p > 0.01 and p < Doing the test on all the data gives a p-value at One may use a normal approximation to compute the p-value, i.e., compute µ = n(n + 1)/4 and σ = n(n + 1)(2n + 1)/24, and Z = T µ σ for n > 20. For smaller values of n use a table. 25 Two-sample Comparison s, the t-test Consider the 55 ecological farmers and the 141 airline workers on a log-scale : Histogram log ECO log(eco[eco > 0]) Histogram log SAS log(sas[sas > 0]) One may want to know if these two groups really could be varying around the same level, and that the differences we see is due to random variation. We start by proposing a statistical model in which we can answer the question: Y i,j = µ i + ǫ i,j i = 1, 2, j = 1,...n i where ǫ i,j N(0, σ 2 i ) are independent noise terms. The histograms of the data shows that the model is a good description of the data on log-scale. Estimating the mean and variability in the two populations underlying the samples give that µ 1 = 3.9 σ1 2 = 1.08 µ 2 = 4.2 σ2 2 =

14 Two-sample Comparison s, the t-test To carry out a two-sample t-test we first need to check if the variability is the same in the two groups. We test if H 0 : σ 1 = σ 2 versus H A : σ 1 σ 2. And use the following test-statistic : F = max(σ2 1, σ 2 2) min(σ 2 1, σ 2 2) = = 1.27 which we should look up in F distribution with (140, 54) degrees of freedom (p=0.32). So we accept hypothesis. Now we can calculate a combined estimate of the variability SD 2 = (n 1 1)σ1 2 + (n 2 1)σ2 2 (n 1 1) + (n 2 1) = = With the combined variability estimate SD we can proceed to the twosample T-test for H 0 : µ 1 = µ 2 versus H A : µ 1 µ 2 T = ȳ 1 ȳ 2 SD (1/n 1 ) + (1/n 2 ) = 2.82 which we look up in t-distribution with n 1 + n 2 2 = f 1 + f 2 degrees of freedom. (p=0.006). We conclude that the ecological farmers have a significantly higher sperm-level than the airline workers. A 95 % confidence interval for the difference in means of the two groups are given by : (ȳ 1 ȳ SED, ȳ 1 ȳ SED) = ( , ) where SED = SD ( (1/n 1 ) + (1/n 2 )). Non-parametric Two-sample Comparison s, The rank test The non-parametric rank test is also called the Wilcoxon-Mann-Whitney test. Consider two groups of data as before. We now wish to test if the distribution of the two population could be equal, or if this must be rejected by a test. The statistical model : : Y i,j arbitrary distribution F i ( ). : All data points are independent. In this non-parametric model we wish to test if : H 0 : Distributions are the same versus H A : Distributions are not the same. We calculate a test-statistic as follows: Pool all data and assign ranks. Sum ranks of smallest group. Look sum of ranks up in statistical table to get p-value. Sum of ranks, T, for ecological farmers is 6342 (total sum of ranks is 19306, and * (55/196) = 5405) which result in p-value at (computer program). One may use a normal approximation to compute the p-value, i.e., compute µ = n 1 (n 1 + n 2 + 1)/2 (5390) and σ = n 2 µ/6 (356), and Z = T µ σ for n 1, n 2 > 10. For smaller values, use a table

15 Paired Comparison s When data is paired the two measurements often are not independent: Make graphs of data. Summary Measuring right- and left bicep. Growth before and after treatment. Height of men of women when sampled as couples. With only two correlated measurements, the data may anyway be analysed by simple techniques. A correct analysis is obtained by making one-sample analysis on the differences. The differences between the before and after measurements are namely independent among subjects. Therefore one should simply test if the differences are varying around 0, by either a t-test or a signedrank-test. When investigating the effect of some drug that prevents sun-burn, say, we could apply the sun-lotion to one arm and placebo to the other. The difference between the arms may be ascribed to the lotion. The difference is a measure that is corrected for inter-individual variation, which may be large. One-sample test: When the variation is approximately normal the t-test may be used to test a hypothesis about the mean of the underlying population. The p-value provided is only valid if the variation is approximately normal. A nice summary of data is provided by the confidence interval of the mean. When data is not normally distributed and interest is concentrated on inference rather than estimates the signed-rank-test may be used. This test is always valid. No confidence intervals may be given. Right skewed data may be transformed to approximate normality by transformations like x, x 1/3, log. Two-sample test: Two groups of data may be compared by the t-test when the variation is approximately normal and the variance of the residual variation is equal in the two groups. A nice summary of difference between the groups are given by the confidence interval for the difference between the means. When data is not normally distributed and interest is concentrated on inference rather than estimates the rank-test may be used. This test is always valid. No confidence intervals may be given. Paired data is handled by sample techniques on the differences between the pairs

16 Statistical Analysis using Analyst (SAS) Analyst is a windows based application in the SAS statistical software. SAS is activated by clicking : start statistik SAS in the lower lefthand corner. Analyst is activated after solutions analysis Analyst Commands will be presented as we need them for the various analyses, and remember that the focus is on the statistical analyses rather than how one do this and that. We consider data on sperm concentration (mill/ml) on two groups of people in a study. One group are members of an association that promotes the development of organic agriculture (n=55), and another group of workers from a major Scandinavian airline carrier (n=141). now type a new name e.g. oeko12 if you are in from of machine 12. The data-set contains the following variables : obs observation number. abstime length of abstinence in days. age age of subject. s1e2 group indicator. conc sperm concentration (mill/ml). volume volume of sperm sample (ml). The data is loaded file open... from n:\human\oeko that is a SAS data-set. Doing this the data will appear in the data table. It consists of a record for each subject with the variables described above. To make your own new variables when you work with the data you must create your own version of the data. You do this by saving your own version of the data under a new name : File Save

17 Data Manipulations A little bit of data manipulation is needed. New transformed variables are constructed by setting the data frame in edit mode edit mode edit and then data transform compute... now type new variable name (e.g. conc3) and an expression that defines the new variable in the box below the equality-sign (e.g. conc**.3333). Now, a new variable called conc3 that is equal to concentrations on cube-root scale is defined and appears in the data table. Data Manipulations To group a continuous variable according to its value and to define a classification variable based on it : data transform recode ranges... in the recode dialog give column name (volume) and name of the new grouped version (gvol) and click ok. Now in the next window give the bounds 0,3; 3,4, and 4,15 for the first three groups and name them (1,2,3) in the rightmost column, click ok. To delete variable highlite the column in the data-table : edit delete Alternatively, one may take on of the standard transformations like conc after highlighting the column one wishes to transform by data transform To make a variable that can be used for the one-sample test (e.g. ld40=lconc-log(40)) data transform compute... now type new variable name (ld40) and the expression that defines the new variable in the box below the equality sign log(conc)-log(40). To construct a subset of the data, e.g., the subset of ecological farmers for an specific analysis for this group : data filter subset data... in the subset dialog you can apply a Where clause to the data (click s1e2 and eq and constant value followed by 1 to select s1e2=1 the Airline workers)

18 Histograms To make a histogram of concentration ( conc ) graphs histogram... select conc as the analysis variable and s1e2 as the class variable (the classification variable). If the class variable is omitted no-classification variable will used. Now, clicking ok does the job. Simple descriptive Statistics To compute mean, standard deviations, variances, medians and percentiles as well as the range statistics descriptive distributions... select conc as the analysis variable and s1e2 as the class variable (the classification variable). Now, clicking ok does the job Organic farmers Airline Output S1E2= Univariate Procedure Variable=CONC Moments N 141 Sum Wgts 141 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr> T Num ^= Num > M(Sign) 69.5 Pr>= M Sgn Rank 4865 Pr>= S eco sas Quantiles(Def=5) To examine the normality of a variable one may draw the histogram for a normal distribution on the same plot. To do this click fit in the distribution-dialog and and select normal and ok in the fit-dialog before clicking ok on the distribution-dialog. 100% Max % % Q % % Med 48 90% % Q % 12 0% Min 0 5% 3.3 1% 0 Range 402 Q3-Q1 68 Mode 12 Extremes Lowest Obs Highest Obs 0( 40) 233( 92) 0( 1) 284( 102) 0.75( 67) 308( 32) 1.88( 60) 358( 104) 2.3( 132) 402( 69) S1E2= Univariate Procedure 35 36

19 Variable=CONC Moments N 55 Sum Wgts 55 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr> T Num ^= 0 54 Num > 0 54 M(Sign) 27 Pr>= M Sgn Rank Pr>= S Quantiles(Def=5) 100% Max % % Q % % Med 69 90% % Q % 15 0% Min 0 5% 9.1 1% 0 Range 354 Q3-Q1 105 Mode 69 Extremes Lowest Obs Highest Obs 0( 40) 264( 32) 5.5( 15) 264( 33) 9.1( 35) 297( 47) 11( 42) 322( 14) 14( 10) 354( 51) One-sample T-test and Signed Rank Test We wish to examine if the hypothesis that the sperm level varies around 40 mill/ml can be statistically rejected or validated. To make a one-sample t-test first transform to log-scale to obtain approximate normality and then compute a new variable dl40=lconc-log(40) (see above). Now, statistics descriptive distributions... selecting the variable dl40 and with class equal to s1e2 does the job. Output: S1E2= Univariate Procedure Variable=DL40 Moments N 139 Sum Wgts 139 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr> T Num ^= Num > 0 79 M(Sign) 9.5 Pr>= M Sgn Rank 816 Pr>= S Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Range Q3-Q Mode Extremes Lowest Obs Highest Obs ( 67) ( 92) ( 60) ( 102) ( 132) ( 32) ( 111) ( 104) ( 49) ( 69) 37 38

20 Missing Value. Count 2 % Count/Nobs S1E2= Univariate Procedure Variable=DL40 Moments N 54 Sum Wgts 54 Mean Sum Std Dev Variance Skewness Kurtosis USS CSS CV Std Mean T:Mean= Pr> T Num ^= 0 54 Num > 0 41 M(Sign) 14 Pr>= M Sgn Rank Pr>= S Quantiles(Def=5) 100% Max % % Q % % Med % % Q % % Min % % Range Q3-Q Mode Extremes Lowest Obs Highest Obs ( 15) ( 32) ( 35) ( 33) ( 42) ( 47) ( 10) ( 14) ( 29) ( 51) One-sample T-test Alternatively one may use a special menu that has been designed especially for the one-sample t-test statistics hypothesis tests One-sample t-test... selecting the variable lconc and entering the mean we wish to test as 4. Note that the t-test should be carried out only the group of ecological farmers, say, and that the active data-set therefore should be only this group. To make the test it is necessary to construct a new data set that consists of the group of interest as done in the data manipulation section above. Output: One Sample T Test for a Mean Sample Statistics for LCONC N Mean Std. Dev. Std. Error Hypothesis Test Null hypothesis: Mean of LCONC = 4 Alternative: Mean of LCONC ^= 4 t Statistic Df Prob > t To make the t-test of the two groups, you can specify that you want it done for the two groups under the variables button, by given s1e2 as the by variable. Missing Value. Count 1 % Count/Nobs

21 Two-sample T-test for Means (un-paired data) To compare the concentrations for the two groups statistics hypothesis tests Two-sample t-test for means... selecting the variable lconc and the group variable s1e2. Output: Two Sample T Test for the Means of LCONC within S1E2 Sample Statistics Group N Mean Std. Dev. Std. Error Hypothesis Test Null hypothesis: Mean 1 - Mean 2 = 0 Alternative: Mean 1 - Mean 2 ^= 0 If Variances Are t statistic Df Pr > t Equal Not Equal Two-sample T-test for Variances (un-paired data) To compare the concentrations for the two groups statistics hypothesis tests Two-sample t-test for variances... selecting the variable lconc and the group variable s1e2. Output: Two Sample Test for Variances of LCONC within S1E2 Sample Statistics S1E2 Group N Mean Std. Dev. Variance Hypothesis Test Null hypothesis: Variance 1 / Variance 2 = 1 Alternative: Variance 1 / Variance 2 ^= 1 - Degrees of Freedom - F Numer. Denom. Pr > F It is useful to supplement the analysis with some plots. Try for example the plots button, and select one of the plots. The conclusions are based on an assumption of equal variances, and this should be validated. The output may indicate that this is the case, but if in doubt one can carry out a test that shows have serious the deviation from equal variances are

22 Two-Sample Signed Rank Test The two-sample signed rank test can more generally by considered as a special case of the Kruskal-Wallis test that test if k groups have the same distribution. To carry out the two-sample signed rank test : statistics ANOVA non-parametric one-way ANOVA... selecting the variable conc and the group variable s1e2. Output: Wilcoxon Scores (Rank Sums) for Variable CONC Classified by Variable S1E2 Sum of Expected Std Dev Mean S1E2 N Scores Under H0 Under H0 Score Average Scores Were Used for Ties Exercise-I Rather than considering the concentration we shall now consider the volume of each sperm sample as the parameter of interest. We wish to compare the ecological farmers and the airline workers. A volume of 3 ml is considered normal. Investigate further if the two groups are normal in this respect. 3) Without doing any computer work make a strategy for how such an analyses can and should be carried out. What descriptive plots and statistics are needed? What hypothesis are formulated and tested? How will you validate the necessary assumptions for the suggested analysis? 4) Do the analyses, make the plots and so on. Remember to interpret the results according to the subject matter. Wilcoxon 2-Sample Test (Normal Approximation) (with Continuity Correction of.5) S = Z = Prob > Z = T-Test Approx. Significance = Kruskal-Wallis Test (Chi-Square Approximation) CHISQ = DF = 1 Prob > CHISQ =

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Background to Statistics

Background to Statistics FACT SHEET Background to Statistics Introduction Statistics include a broad range of methods for manipulating, presenting and interpreting data. Professional scientists of all kinds need to be proficient

More information

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01 An Analysis of College Algebra Exam s December, 000 James D Jones Math - Section 0 An Analysis of College Algebra Exam s Introduction Students often complain about a test being too difficult. Are there

More information

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown Nonparametric Statistics Leah Wright, Tyler Ross, Taylor Brown Before we get to nonparametric statistics, what are parametric statistics? These statistics estimate and test population means, while holding

More information

Business Statistics. Lecture 10: Course Review

Business Statistics. Lecture 10: Course Review Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests PSY 307 Statistics for the Behavioral Sciences Chapter 20 Tests for Ranked Data, Choosing Statistical Tests What To Do with Non-normal Distributions Tranformations (pg 382): The shape of the distribution

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information

Solutions exercises of Chapter 7

Solutions exercises of Chapter 7 Solutions exercises of Chapter 7 Exercise 1 a. These are paired samples: each pair of half plates will have about the same level of corrosion, so the result of polishing by the two brands of polish are

More information

Using SPSS for One Way Analysis of Variance

Using SPSS for One Way Analysis of Variance Using SPSS for One Way Analysis of Variance This tutorial will show you how to use SPSS version 12 to perform a one-way, between- subjects analysis of variance and related post-hoc tests. This tutorial

More information

Distribution-Free Procedures (Devore Chapter Fifteen)

Distribution-Free Procedures (Devore Chapter Fifteen) Distribution-Free Procedures (Devore Chapter Fifteen) MATH-5-01: Probability and Statistics II Spring 018 Contents 1 Nonparametric Hypothesis Tests 1 1.1 The Wilcoxon Rank Sum Test........... 1 1. Normal

More information

Data analysis and Geostatistics - lecture VII

Data analysis and Geostatistics - lecture VII Data analysis and Geostatistics - lecture VII t-tests, ANOVA and goodness-of-fit Statistical testing - significance of r Testing the significance of the correlation coefficient: t = r n - 2 1 - r 2 with

More information

The Chi-Square Distributions

The Chi-Square Distributions MATH 03 The Chi-Square Distributions Dr. Neal, Spring 009 The chi-square distributions can be used in statistics to analyze the standard deviation of a normally distributed measurement and to test the

More information

Analysis of 2x2 Cross-Over Designs using T-Tests

Analysis of 2x2 Cross-Over Designs using T-Tests Chapter 234 Analysis of 2x2 Cross-Over Designs using T-Tests Introduction This procedure analyzes data from a two-treatment, two-period (2x2) cross-over design. The response is assumed to be a continuous

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

The Chi-Square Distributions

The Chi-Square Distributions MATH 183 The Chi-Square Distributions Dr. Neal, WKU The chi-square distributions can be used in statistics to analyze the standard deviation σ of a normally distributed measurement and to test the goodness

More information

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future

More information

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004 UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 50W - Introduction to Biostatistics Fall 00 Exercises with Solutions Topic Summarizing Data Due: Monday September 7, 00 READINGS.

More information

Relating Graph to Matlab

Relating Graph to Matlab There are two related course documents on the web Probability and Statistics Review -should be read by people without statistics background and it is helpful as a review for those with prior statistics

More information

ANOVA - analysis of variance - used to compare the means of several populations.

ANOVA - analysis of variance - used to compare the means of several populations. 12.1 One-Way Analysis of Variance ANOVA - analysis of variance - used to compare the means of several populations. Assumptions for One-Way ANOVA: 1. Independent samples are taken using a randomized design.

More information

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides Chapter 7 Inference for Distributions Introduction to the Practice of STATISTICS SEVENTH EDITION Moore / McCabe / Craig Lecture Presentation Slides Chapter 7 Inference for Distributions 7.1 Inference for

More information

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

unadjusted model for baseline cholesterol 22:31 Monday, April 19, unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol

More information

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

One-Way ANOVA. Some examples of when ANOVA would be appropriate include: One-Way ANOVA 1. Purpose Analysis of variance (ANOVA) is used when one wishes to determine whether two or more groups (e.g., classes A, B, and C) differ on some outcome of interest (e.g., an achievement

More information

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS In our work on hypothesis testing, we used the value of a sample statistic to challenge an accepted value of a population parameter. We focused only

More information

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67

Chapter 6. The Standard Deviation as a Ruler and the Normal Model 1 /67 Chapter 6 The Standard Deviation as a Ruler and the Normal Model 1 /67 Homework Read Chpt 6 Complete Reading Notes Do P129 1, 3, 5, 7, 15, 17, 23, 27, 29, 31, 37, 39, 43 2 /67 Objective Students calculate

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

SPSS Guide For MMI 409

SPSS Guide For MMI 409 SPSS Guide For MMI 409 by John Wong March 2012 Preface Hopefully, this document can provide some guidance to MMI 409 students on how to use SPSS to solve many of the problems covered in the D Agostino

More information

Hotelling s One- Sample T2

Hotelling s One- Sample T2 Chapter 405 Hotelling s One- Sample T2 Introduction The one-sample Hotelling s T2 is the multivariate extension of the common one-sample or paired Student s t-test. In a one-sample t-test, the mean response

More information

Inferences About the Difference Between Two Means

Inferences About the Difference Between Two Means 7 Inferences About the Difference Between Two Means Chapter Outline 7.1 New Concepts 7.1.1 Independent Versus Dependent Samples 7.1. Hypotheses 7. Inferences About Two Independent Means 7..1 Independent

More information

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career. Introduction to Data and Analysis Wildlife Management is a very quantitative field of study Results from studies will be used throughout this course and throughout your career. Sampling design influences

More information

20 Hypothesis Testing, Part I

20 Hypothesis Testing, Part I 20 Hypothesis Testing, Part I Bob has told Alice that the average hourly rate for a lawyer in Virginia is $200 with a standard deviation of $50, but Alice wants to test this claim. If Bob is right, she

More information

Statistics: revision

Statistics: revision NST 1B Experimental Psychology Statistics practical 5 Statistics: revision Rudolf Cardinal & Mike Aitken 29 / 30 April 2004 Department of Experimental Psychology University of Cambridge Handouts: Answers

More information

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA) 22s:152 Applied Linear Regression Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA) We now consider an analysis with only categorical predictors (i.e. all predictors are

More information

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2 Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2 Fall, 2013 Page 1 Random Variable and Probability Distribution Discrete random variable Y : Finite possible values {y

More information

INTRODUCTION TO ANALYSIS OF VARIANCE

INTRODUCTION TO ANALYSIS OF VARIANCE CHAPTER 22 INTRODUCTION TO ANALYSIS OF VARIANCE Chapter 18 on inferences about population means illustrated two hypothesis testing situations: for one population mean and for the difference between two

More information

MATH Notebook 3 Spring 2018

MATH Notebook 3 Spring 2018 MATH448001 Notebook 3 Spring 2018 prepared by Professor Jenny Baglivo c Copyright 2010 2018 by Jenny A. Baglivo. All Rights Reserved. 3 MATH448001 Notebook 3 3 3.1 One Way Layout........................................

More information

Continuous random variables

Continuous random variables Continuous random variables A continuous random variable X takes all values in an interval of numbers. The probability distribution of X is described by a density curve. The total area under a density

More information

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics Exploring Data: Distributions Look for overall pattern (shape, center, spread) and deviations (outliers). Mean (use a calculator): x = x 1 + x

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

Nonparametric Statistics

Nonparametric Statistics Nonparametric Statistics Nonparametric or Distribution-free statistics: used when data are ordinal (i.e., rankings) used when ratio/interval data are not normally distributed (data are converted to ranks)

More information

Non-parametric tests, part A:

Non-parametric tests, part A: Two types of statistical test: Non-parametric tests, part A: Parametric tests: Based on assumption that the data have certain characteristics or "parameters": Results are only valid if (a) the data are

More information

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p. Preface p. xi Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p. 6 The Scientific Method and the Design of

More information

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 13 Nonparametric Statistics 13-1 Overview 13-2 Sign Test 13-3 Wilcoxon Signed-Ranks

More information

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F. Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 13 Nonparametric Statistics 13-1 Overview 13-2 Sign Test 13-3 Wilcoxon Signed-Ranks

More information

Topic 23: Diagnostics and Remedies

Topic 23: Diagnostics and Remedies Topic 23: Diagnostics and Remedies Outline Diagnostics residual checks ANOVA remedial measures Diagnostics Overview We will take the diagnostics and remedial measures that we learned for regression and

More information

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I

Nonparametric tests. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 704: Data Analysis I 1 / 16 Nonparametric tests Timothy Hanson Department of Statistics, University of South Carolina Stat 704: Data Analysis I Nonparametric one and two-sample tests 2 / 16 If data do not come from a normal

More information

LOOKING FOR RELATIONSHIPS

LOOKING FOR RELATIONSHIPS LOOKING FOR RELATIONSHIPS One of most common types of investigation we do is to look for relationships between variables. Variables may be nominal (categorical), for example looking at the effect of an

More information

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College 1-Way ANOVA MATH 143 Department of Mathematics and Statistics Calvin College Spring 2010 The basic ANOVA situation Two variables: 1 Categorical, 1 Quantitative Main Question: Do the (means of) the quantitative

More information

Analysis of variance (ANOVA) Comparing the means of more than two groups

Analysis of variance (ANOVA) Comparing the means of more than two groups Analysis of variance (ANOVA) Comparing the means of more than two groups Example: Cost of mating in male fruit flies Drosophila Treatments: place males with and without unmated (virgin) females Five treatments

More information

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij = K. Model Diagnostics We ve already seen how to check model assumptions prior to fitting a one-way ANOVA. Diagnostics carried out after model fitting by using residuals are more informative for assessing

More information

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 1 of 16 Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users #1. The following table lists length of

More information

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics Mathematics Curriculum A. DESCRIPTION This is a full year courses designed to introduce students to the basic elements of statistics and probability. Emphasis is placed on understanding terminology and

More information

Correlation and Regression

Correlation and Regression Correlation and Regression Dr. Bob Gee Dean Scott Bonney Professor William G. Journigan American Meridian University 1 Learning Objectives Upon successful completion of this module, the student should

More information

Wilcoxon Test and Calculating Sample Sizes

Wilcoxon Test and Calculating Sample Sizes Wilcoxon Test and Calculating Sample Sizes Dan Spencer UC Santa Cruz Dan Spencer (UC Santa Cruz) Wilcoxon Test and Calculating Sample Sizes 1 / 33 Differences in the Means of Two Independent Groups When

More information

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic.

I i=1 1 I(J 1) j=1 (Y ij Ȳi ) 2. j=1 (Y j Ȳ )2 ] = 2n( is the two-sample t-test statistic. Serik Sagitov, Chalmers and GU, February, 08 Solutions chapter Matlab commands: x = data matrix boxplot(x) anova(x) anova(x) Problem.3 Consider one-way ANOVA test statistic For I = and = n, put F = MS

More information

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and

More information

Lectures 5 & 6: Hypothesis Testing

Lectures 5 & 6: Hypothesis Testing Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across

More information

Pooled Variance t Test

Pooled Variance t Test Pooled Variance t Test Tests means of independent populations having equal variances Parametric test procedure Assumptions Both populations are normally distributed If not normal, can be approximated by

More information

SPSS LAB FILE 1

SPSS LAB FILE  1 SPSS LAB FILE www.mcdtu.wordpress.com 1 www.mcdtu.wordpress.com 2 www.mcdtu.wordpress.com 3 OBJECTIVE 1: Transporation of Data Set to SPSS Editor INPUTS: Files: group1.xlsx, group1.txt PROCEDURE FOLLOWED:

More information

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă HYPOTHESIS TESTING II TESTS ON MEANS Sorana D. Bolboacă OBJECTIVES Significance value vs p value Parametric vs non parametric tests Tests on means: 1 Dec 14 2 SIGNIFICANCE LEVEL VS. p VALUE Materials and

More information

Module 9: Nonparametric Statistics Statistics (OA3102)

Module 9: Nonparametric Statistics Statistics (OA3102) Module 9: Nonparametric Statistics Statistics (OA3102) Professor Ron Fricker Naval Postgraduate School Monterey, California Reading assignment: WM&S chapter 15.1-15.6 Revision: 3-12 1 Goals for this Lecture

More information

22s:152 Applied Linear Regression. Take random samples from each of m populations.

22s:152 Applied Linear Regression. Take random samples from each of m populations. 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

Box-Cox Transformations

Box-Cox Transformations Box-Cox Transformations Revised: 10/10/2017 Summary... 1 Data Input... 3 Analysis Summary... 3 Analysis Options... 5 Plot of Fitted Model... 6 MSE Comparison Plot... 8 MSE Comparison Table... 9 Skewness

More information

Data Analysis and Statistical Methods Statistics 651

Data Analysis and Statistical Methods Statistics 651 Data Analysis and Statistical Methods Statistics 65 http://www.stat.tamu.edu/~suhasini/teaching.html Suhasini Subba Rao Review In the previous lecture we considered the following tests: The independent

More information

6 Single Sample Methods for a Location Parameter

6 Single Sample Methods for a Location Parameter 6 Single Sample Methods for a Location Parameter If there are serious departures from parametric test assumptions (e.g., normality or symmetry), nonparametric tests on a measure of central tendency (usually

More information

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008

2.830J / 6.780J / ESD.63J Control of Manufacturing Processes (SMA 6303) Spring 2008 MIT OpenCourseWare http://ocw.mit.edu 2.830J / 6.780J / ESD.63J Control of Processes (SMA 6303) Spring 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Non-parametric (Distribution-free) approaches p188 CN

Non-parametric (Distribution-free) approaches p188 CN Week 1: Introduction to some nonparametric and computer intensive (re-sampling) approaches: the sign test, Wilcoxon tests and multi-sample extensions, Spearman s rank correlation; the Bootstrap. (ch14

More information

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System

Outline. Introduction to SpaceStat and ESTDA. ESTDA & SpaceStat. Learning Objectives. Space-Time Intelligence System. Space-Time Intelligence System Outline I Data Preparation Introduction to SpaceStat and ESTDA II Introduction to ESTDA and SpaceStat III Introduction to time-dynamic regression ESTDA ESTDA & SpaceStat Learning Objectives Activities

More information

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA

22s:152 Applied Linear Regression. There are a couple commonly used models for a one-way ANOVA with m groups. Chapter 8: ANOVA 22s:152 Applied Linear Regression Chapter 8: ANOVA NOTE: We will meet in the lab on Monday October 10. One-way ANOVA Focuses on testing for differences among group means. Take random samples from each

More information

Chapter 7 Comparison of two independent samples

Chapter 7 Comparison of two independent samples Chapter 7 Comparison of two independent samples 7.1 Introduction Population 1 µ σ 1 1 N 1 Sample 1 y s 1 1 n 1 Population µ σ N Sample y s n 1, : population means 1, : population standard deviations N

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Essential Statistics Chapter 6

Essential Statistics Chapter 6 1 Essential Statistics Chapter 6 By Navidi and Monk Copyright 2016 Mark A. Thomas. All rights reserved. 2 Continuous Probability Distributions chapter 5 focused upon discrete probability distributions,

More information

Ch. 16: Correlation and Regression

Ch. 16: Correlation and Regression Ch. 1: Correlation and Regression With the shift to correlational analyses, we change the very nature of the question we are asking of our data. Heretofore, we were asking if a difference was likely to

More information

Lecture 18: Simple Linear Regression

Lecture 18: Simple Linear Regression Lecture 18: Simple Linear Regression BIOS 553 Department of Biostatistics University of Michigan Fall 2004 The Correlation Coefficient: r The correlation coefficient (r) is a number that measures the strength

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics

Chapter 15: Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics Section 15.1: An Overview of Nonparametric Statistics Understand Difference between Parametric and Nonparametric Statistical Procedures Parametric statistical procedures inferential procedures that rely

More information

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES BIOL 458 - Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES PART 1: INTRODUCTION TO ANOVA Purpose of ANOVA Analysis of Variance (ANOVA) is an extremely useful statistical method

More information

Rank-Based Methods. Lukas Meier

Rank-Based Methods. Lukas Meier Rank-Based Methods Lukas Meier 20.01.2014 Introduction Up to now we basically always used a parametric family, like the normal distribution N (µ, σ 2 ) for modeling random data. Based on observed data

More information

Intuitive Biostatistics: Choosing a statistical test

Intuitive Biostatistics: Choosing a statistical test pagina 1 van 5 < BACK Intuitive Biostatistics: Choosing a statistical This is chapter 37 of Intuitive Biostatistics (ISBN 0-19-508607-4) by Harvey Motulsky. Copyright 1995 by Oxfd University Press Inc.

More information

4.1. Introduction: Comparing Means

4.1. Introduction: Comparing Means 4. Analysis of Variance (ANOVA) 4.1. Introduction: Comparing Means Consider the problem of testing H 0 : µ 1 = µ 2 against H 1 : µ 1 µ 2 in two independent samples of two different populations of possibly

More information

Topic 1. Definitions

Topic 1. Definitions S Topic. Definitions. Scalar A scalar is a number. 2. Vector A vector is a column of numbers. 3. Linear combination A scalar times a vector plus a scalar times a vector, plus a scalar times a vector...

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA) BSTT523 Pagano & Gauvreau Chapter 13 1 Nonparametric Statistics Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA) In particular, data

More information

Mathematics for Economics MA course

Mathematics for Economics MA course Mathematics for Economics MA course Simple Linear Regression Dr. Seetha Bandara Simple Regression Simple linear regression is a statistical method that allows us to summarize and study relationships between

More information

Nonparametric Methods

Nonparametric Methods Nonparametric Methods Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Nonparametric Methods, or Distribution Free Methods is for testing from a population without knowing anything about the

More information

Comparison of Two Population Means

Comparison of Two Population Means Comparison of Two Population Means Esra Akdeniz March 15, 2015 Independent versus Dependent (paired) Samples We have independent samples if we perform an experiment in two unrelated populations. We have

More information

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Preliminary Statistics course. Lecture 1: Descriptive Statistics Preliminary Statistics course Lecture 1: Descriptive Statistics Rory Macqueen (rm43@soas.ac.uk), September 2015 Organisational Sessions: 16-21 Sep. 10.00-13.00, V111 22-23 Sep. 15.00-18.00, V111 24 Sep.

More information

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials.

The entire data set consists of n = 32 widgets, 8 of which were made from each of q = 4 different materials. One-Way ANOVA Summary The One-Way ANOVA procedure is designed to construct a statistical model describing the impact of a single categorical factor X on a dependent variable Y. Tests are run to determine

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

STAT 135 Lab 8 Hypothesis Testing Review, Mann-Whitney Test by Normal Approximation, and Wilcoxon Signed Rank Test.

STAT 135 Lab 8 Hypothesis Testing Review, Mann-Whitney Test by Normal Approximation, and Wilcoxon Signed Rank Test. STAT 135 Lab 8 Hypothesis Testing Review, Mann-Whitney Test by Normal Approximation, and Wilcoxon Signed Rank Test. Rebecca Barter March 30, 2015 Mann-Whitney Test Mann-Whitney Test Recall that the Mann-Whitney

More information

Exam details. Final Review Session. Things to Review

Exam details. Final Review Session. Things to Review Exam details Final Review Session Short answer, similar to book problems Formulae and tables will be given You CAN use a calculator Date and Time: Dec. 7, 006, 1-1:30 pm Location: Osborne Centre, Unit

More information

Tentative solutions TMA4255 Applied Statistics 16 May, 2015

Tentative solutions TMA4255 Applied Statistics 16 May, 2015 Norwegian University of Science and Technology Department of Mathematical Sciences Page of 9 Tentative solutions TMA455 Applied Statistics 6 May, 05 Problem Manufacturer of fertilizers a) Are these independent

More information

Didacticiel Études de cas. Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples.

Didacticiel Études de cas. Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples. 1 Subject Parametric hypothesis testing for comparison of two or more populations. Independent and dependent samples. The tests for comparison of population try to determine if K (K 2) samples come from

More information

Frequency Distribution Cross-Tabulation

Frequency Distribution Cross-Tabulation Frequency Distribution Cross-Tabulation 1) Overview 2) Frequency Distribution 3) Statistics Associated with Frequency Distribution i. Measures of Location ii. Measures of Variability iii. Measures of Shape

More information

Week 7.1--IES 612-STA STA doc

Week 7.1--IES 612-STA STA doc Week 7.1--IES 612-STA 4-573-STA 4-576.doc IES 612/STA 4-576 Winter 2009 ANOVA MODELS model adequacy aka RESIDUAL ANALYSIS Numeric data samples from t populations obtained Assume Y ij ~ independent N(μ

More information

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation y = a + bx y = dependent variable a = intercept b = slope x = independent variable Section 12.1 Inference for Linear

More information

The Empirical Rule, z-scores, and the Rare Event Approach

The Empirical Rule, z-scores, and the Rare Event Approach Overview The Empirical Rule, z-scores, and the Rare Event Approach Look at Chebyshev s Rule and the Empirical Rule Explore some applications of the Empirical Rule How to calculate and use z-scores Introducing

More information