Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative From To. Percent <x<=

Size: px

Start display at page:

Download "Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative From To. Percent <x<="

Morris O’Neal’
6 years ago
Views:

A frequency distribution is a kind of probability distribution. It gives the frequency or relative frequency at which given values have been observed among the data collected.

00000<x<=30.00000 3 4 0.565 89.065 30.00000<x<=40.00000 7 5.46875 94.533 40.00000<x<=50.00000 5 6 3.9065 98.4375 50.00000<x<=60.00000 8.5650 00.0000 60.00000<x<=70.00000 0 8 0.00000 00.

1 A frequency distribution is a kind of probability distribution. It gives the frequency or relative frequency at which given values have been observed among the data collected. For example, for age, Frequency table: Var (Spreadsheet) Count Cumulative Percent Cumulative From To Count Percent <x<= <x<= <x<= <x<= <x<= <x<= <x<= Missing From the above frequency distribution we can determine that The probability of a patient being between 0 and 0 years of age = The probability of a patient being less than or equal to 30 years = A random variable = a quantity that can assume a number of different values such that any particular outcome is determined by chance. Discrete random variables take on isolated values, e.g. gender=male female number of bacteria in a slide or a volume of suspension Continuous random variables take on any value within a specified interval, e.g. weight or height or serum cholesterol. A probability distribution describes the behaviour of a random variable. For discrete variables, it specifies all possible outcomes along with the probability that each will occur. For continuous variables, a probability distribution specifies the probability associated with specified ranges of values.

2 Three theoretical probability distributions. The binomial distribution Consider a random variable that can take on one of two possible values, where one of the values can be viewed as a success and the other as a failure, e.g., Y=the presence/absence of disease. Y is a Bernoulli random variable. Let X=the number of successes out of n trials, i.e., the number of patients who are diseased out of n patients examined, then The probability distribution for X is the Binomial distribution from which we can determine the following expression for the probability that X takes on a specific value, x, n x n x P( X = x) = p ( p) x where p=probability of disease. Binomial probability distributions: n=0,p=0.7

3 Application of Binomial probability distribution. (from Pagano & Gauvreau) Suppose we are interested in investigating the probability that a patient who has been stuck with a needle infected with hepatitis B actually develops the disease. Let Y = the disease status = if individual develops hepatitis =0 if not. If 30% of the patients who are exposed to hepatitis B become infected, then P(Y=)=0.30 and P(Y=0)=-0.3=0.70. Suppose we select 5 individuals from the population of patients who have been stuck with a needle infected with hepatitis B. Then X= number of patients who develops the disease is a binomial random variable with n=5 and p=0.30. So we can calculate the probability that X assumes given values as follows: p( X 5 5 = ) = (0.30) ( 0.30) = = In addition, the mean number of people who develop the disease in repeated samples of size 5 is np = =. 5 and the standard deviation is np( p) = =. 03. The Poisson distribution: -used to model discrete events that occur infrequently in time or space. More specifically, let X= the number of occurrences of some event of interest over a given interval. Let λ= the average number of occurrences of the event in the interval. Then λ e P( X = x) = x! x λ The Poisson distribution is used extensively in bacteriology. In this case we are not counting the number of events in a fixed time interval or fixed one-dimensional length, but rather the number of bacteria in a two-dimensional microscopic slide of given area, or the number of bacteria in a fluid suspension of given volume. 3

4 Poisson probability distributions Example: (From Armitage & Berry) Distribution of counts of root nodule bacterium in Petroff-Hausser counting chamber. No. of bacteria per square Number of squares Observed The mean number of organisms per square =(34x0+68x+x+94x3+55x4+x5+x6+4x7) /400 =.50 The use the Poisson probability function with lambda=.5 to calculate the p (of a given no. of bacteria), e.g., P( bacteria per square)= e P( X = ) =.5.5! = To get the expected frequency, multiply this probability with the total number of squares (400), = 0.6 4

5 Example: (From Armitage & Berry) Distribution of counts of root nodule bacterium in Petroff-Hausser counting chamber. No. of bacteria Number of squares per square Observed Expected The mean number of organisms per square =(34x0+68x+x+94x3+55x4+x5+x6+4x7) /400 =.50 The use the Poisson probability function with lambda=.5 to calculate the p (of a given no. of bacteria), e.g., P( bacteria per square)= e P( X = ) =.5.5! = To get the expected frequency, multiply this probability with the total number of squares (400), = 0.6 We note that the observed and expected frequencies correspond quite well, indicating that the bacteria do indeed follow a Poisson distribution. 3. The normal distribution -To model continuous variables like height, weight, serum cholesterol level, etc. - Based on the Central Limit Theorem that states that any variable that can be viewed as a sum of a large number of random increments will follow a Normal distribution -It is bell-shaped and symmetrical -It is characterised by its mean (µ) and standard deviation (σ) Density Cmax_S 5

6 A normal distribution with mean=0 and standard deviation= is called the Standard Normal distribution 0.6 Probability Density Function y=normal(x,0,) We know that 95% of the values of a variable that follows the N(0;) distribution lie between -.96 and We will be using these theoretical probability distributions to construct confidence intervals for population parameters. x µ The sample mean, x ~ N( µ ; σ / n) > ~ N( 0; ) σ / n However, most of the time we do not know σ and estimate it using s. In that case, x µ ~ t n s / n Where t n- is the Student s t-distribution with n- degrees of freedom. 6

7 STATISTICAL INFERENCE In statistics we wish to make statements about the true values and relationships for a complete population. However, it is not practical to measure the entire population. So we take random samples that are representative of the population and use the information contained in these random samples to estimate the values or relationships for the population. If we repeat our sampling and draw another random sample from the population, we will come up with slightly different values of the measures and relationships that we are estimating. So our statistics are subject to uncertainty. We can summarize the different values that our statistics can take on and the frequency with which they are likely to take on these values with a probability distribution. Some of the more common statistics have known probability distributions. Population parameter Sample Statistic Probability distribution Mean, µ x Normal Proportion, π p=n/n Binomial Variance, σ s Chi-square Ratio of two variances F-distribution Sample statistics are point estimates of the population parameters. Every time we draw a different random sample, we ll get a slightly different point estimate. Hence we change the point estimate to an interval estimate that will give us two cut-off points between which we are 95% certain our true population parameter will fall. We call these interval estimates, CONFIDENCE INTERVALS. 7

8 The general form of a confidence interval is ( statistic percentile std. err; statistic + percentile std. err) where percentile=cutoff value that demarcates required percentile of probability distribution; e.g., for the normal distribution, we know that 95% of the values lie between -.96 and +.96, so the percentile for a 95% confidence interval is.96 std.err = estimate of variability of statistic. i.e, if you had to draw many random samples and calculate a sample statistic each time, how would these statistic vary among one another. Note, std.err. = σ /n, i.e, the standard deviation divided by the square root of the sample size. NOTE: the difference between a standard deviation and a standard error: A standard deviation tells you how variable the data values are. A standard error measures the variability of the statistic it is a function of the variability of the data and the sample size. Density Cmax_A. ci Cmax_A Variable Obs Mean Std. Err. [95% Conf. Interval] Cmax_A ci = mean +/- t n- std.err = *.66; *.66 where t n- refers to the Student t-distribution with n- degrees of freedom 8

9 Hypothesis testing (from Fisher & Van Belle, pg 06) In estimation, we start with a sample statistic and make a statement about the population parameter : a confidence interval makes a probabilistic statement about straddling the population parameter. In hypothesis testing, we start by assuming a value for a parameter and a probability statement is made about the value of the corresponding statistic. The basic strategy in hypothesis testing is to measure how far an observed statistic is from a hypothesized value of the parameter, or how likely an observed value for a statistic is given the hypothesized value of the parameter. 0.5 Probability Density Function y=student(x,7) p-value = the probability under the null hypothesis of observing a value as unlikely or more unlikely than the value of the test statistic P-value Hypothesized value Observed value of statistic Hypothesis testing procedure: H 0 : = a null hypothesis that specifies a real value for a parameter ( Usually what you wish to reject and a statement of equivalence) H A = an alternative hypothesis that specifies a range of values for a parameter which will be considered when the null hypothesis is rejected. A test statistic derived under the null hypothesis A p-value that states the probability of observing such a value for the test statistic given that the null hypothesis is true A decision to reject or not the null hypothesis 9

10 Decision Rules and Type I and II errors H 0 =µ=30. ttest weight=30 One-sample t test x µ t = s / n Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] weight Degrees of freedom: 7 Ho: mean(weight) = 30 Ha: mean < 30 Ha: mean!= 30 Ha: mean > 30 t = t = t = P < t = P > t = P > t = Probability Density Function y=student(x,) 0.5 Probability Density Function y=student(x,) 0.5 Probability Density Function y=student(x,) Always use a -sided H A unless there are good a priori reasons to motivate a -sided H A. 0

11 Comparing two means: If data follow Normal distribution, use t-test. T-test has different forms depending on whether variances in two groups are equal or not. To compare variance,use F-test:. sdtest Cmax_A,by(trt) Variance ratio test Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] trta trtb combined Ho: sd(trta) = sd(trtb) F(6,6) observed = F_obs =.07 F(6,6) lower tail = F_L = /F_obs = F(6,6) upper tail = F_U = F_obs =.07 Ha: sd(trta) < sd(trtb) Ha: sd(trta)!= sd(trtb) Ha: sd(trta) > sd(trtb) P < F_obs = P < F_L + P > F_U = P > F_obs = F n ; n s = s = =.07 < F ;6 = ttest Cmax_A,by(trt) Two-sample t test with equal variances Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] trta trtb combined diff Degrees of freedom: 3 Ho: mean(trta) - mean(trtb) = diff = 0 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 t =.0960 t =.0960 t =.0960 P < t = P > t = 0.75 P > t = If variances were not equal, we should use t-test for unequal variances:. ttest Cmax_A,by(trt) unequal t n + n x x ( µ µ ) = s + n n Two-sample t test with unequal variances Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] trta trtb combined diff Satterthwaite's degrees of freedom:.685 Ho: mean(trta) - mean(trtb) = diff = 0 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 t =.0957 t =.0957 t =.0957 P < t = P > t = P > t = x x ( µ µ ) t = s / n + s / n

12 If data do not follow Normal distribution, use non-parametric Mann-Whitney U-test:. ranksum Cmax_A,by(trt) Two-sample Wilcoxon rank-sum (Mann-Whitney) test trt obs rank sum expected trta trtb combined unadjusted variance adjustment for ties adjusted variance Ho: Cmax_A(trt==trtA) = Cmax_A(trt==trtB) z =.03 Prob > z = 0.30 Confidence intervals and Hypothesis testing: Recall that to construct a confidence interval for the mean, we use α α ( x tn n std. err; x + t std. err ) And to test whether the mean differs from zero, we use the test statistic x 0 x t = = s / n std. error The two procedures are thus related and in fact it can be shown that if the α% confidence interval includes zero, then the hypothesis test will not be significant at the α% level of significance. On the other hand, if the confidence interval lies to the right or left of zero, it does correspond to a significant result.

Two Sample Problems. Two sample problems

Two Sample Problems. Two sample problems Two Sample Problems Two sample problems The goal of inference is to compare the responses in two groups. Each group is a sample from a different population. The responses in each group are independent