STATISTICS 4 Summary Notes. Geometric and Exponential Distributions GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs P(X = x) = ( p) x p x =,, 3,... where 0 < p < GIVEN IN FORMULA BOOK Conditions/Assumptions : there is a sequence of independent trials only outcomes, success and failure, constant probability, p of success at each trial a For P(X a) or P(X a) use the sum of a GP to evaluate S = - r X~Geo(0.5) Find P(X>5) P(X>5) = P(X=6)+P(X=7)+P(X=8).. = 05 075 5 + 05 075 6 + 05 075 7... GP. a = 0.5 x 0.75 5 r= 0.75 P(X>5) = 0.5 0.75 5 = 0.37-0.75 Mean of Geo(p) = p Variance of Geo(p) = mean x (mean ) MAKE SURE YOU CAN WRITE OUT FULLY THE PROOF FOR THESE!!! EXPONENTIAL continuous Intervals of time between events occurring related to the Poisson if the number of events occurring in a given period of time is Poisson then the time between successive events is exponential -constant probability of an even occurring per unit of time. f(x) = le lx x ³ 0 0 x < 0 l = average time between events GIVEN IN FORMULA BOOK MEAN = E(X) = VARIANCE = l l For P(X b) or P(X b) use the distribution function P(X b) = b ò 0 le lx dx = [ e lx ] b 0 = e lb Can be shown using integration by parts
Important Feature exponential is a memoryless distribution important for conditional probability. The probability that we need to wait more tha0 more seconds for the first event to occur given that it has not happened after waiting 30 seconds, is the same as the probability that we need to wait more tha0 seconds.. Estimation MAKE SURE YOU LEARN THE PROOF for E(S ) = s (Page 0) A statistic used to estimate the value of a parameter of a population is called an estimator The Most Efficient estimator is the one which o is unbiased it s expected value = the parameter it is estimating o has the smallest variance. Consistent Estimator : If U is an unbiased estimator for an unknown parameter θ, then U is a consistent estimator for θ if Var (U) 0 as n, where n is the size of the sample - you may need to use Σr, Σr, Σr 3 - all given in formula booklet Relative Efficiency of Estimator A to Estimator B = = / Var(Estimator A) / Var(Estimator B) A random variable X has mean µ and variance 0 A random variable Y has mean µ and variance 5 a) Given that ax + by is and unbiased estimator of µ, show that a + b=. E(aX + by) = m ae(x) + be(y) = m ma + mb = m a + b = b) The variance of ax + by is denoted by V. Express V in the form pa + qa + r Var(aX + by) = a Var(X) + b Var(Y) = 0a + 5b =0a + 5(a ) =30a -0a + 5 c) Find the values of a and b such that V takes its minimum value. Method Differentiation Method Completing the square dv = 60a 0 30 æ da è ç a ö 3a + 5 ø 60a 0 = 0 a = 3 so b = 3 30 æ è ç a ö 3 ø 30 æ ç ö è 3 ø + 5 a = 3 so b = 3 d) A single observation is taken on each of X and Y. The values observed are 0 and 6 respectively. Use results from c) to estimate µ. m = E(X) + E(Y) = 0 + 3 3 3 3 6 = 5 3
Estimator of a Population Proportion (Binomial) From a binomial population which p, is the proportion of successes (unknown), a random sample of size n is taken. X is the number of successes P s is the proportion of successes in the sample P s = X n P s is an unbiased estimator for p as æ E X ö E(P s ) = ç = E(X) = è n ø n n (np) = p Mean of a binomial E(X) = np Var(P s ) = Var æ è ç X ö n ø = n Var(X) = p( p) (np( p)) = n n Pooled estimators of Population Proportions Size Unbiased estimator Of popn proportion Variance of a binomial Var(X) = np( - p) Proportion Sample I P s Sample II n P s p = P s + n P s + n E æ n P è ç + n P ö s s + n = [E( P s ) + E(n P s )] ø + n = = = p + n [ E(P s ) + n E(P s )] + n ( p + n p) Pooled estimators Mean and Variance needed for Ci and hypothesis testing Size Mean Variance Sample I X s Sample II n X s Given in formula booklet S p Mean m = X + n X + n Variance Using sample variances s = n s + n s + n Using unbiased Estimators of Population Variances Sample Variance (σ n ) on calculator ( )S + (n )S + n Using summary values s = S (x i x ) + S (x j x j ) + n
3. Confidence Intervals Interpretation of a 95% CI different samples of size n lead to different values of the estimator and hence to different 95% confidence Intervals. On average 95% of these intervals will contain the true population value. Difference between means Assumptions o A Normal Distribution is stated or can be assumed o Unknown Population Variance o Small samples are used x x ± t c s + n t c t-tables n- degrees of freedom 95% look up 0.975 s = ( )S + (n )S + n If the confidence interval includes 0 we can say that we are 95% confident that there is no difference between the means of the two populations. Population Variance (or Standard Deviation) - Uses s unbiased estimate of population variance (s n ) - Uses chi-squared c L (lower) c U (upper) so for 95% use 0.5 and 0.975 - n - degrees of freedom (n )s c U < s < (n )s c L Standard Deviation Confidence Interval Work as for variance but square root the final answers Ratio of two normal population Variances - uses s unbiased estimate of population variance (s n ) - uses F-Distribution must get the degrees of freedom in the correct order Sample X Sample size = n x Degrees of freedom v x = n x - Sample Y Sample size = n y Degrees of freedom v y = n y - If looking for 90% Confidence Interval use p = 0.95 (5% at upper and lower but use the upper limit to find the values of F) numerator denominator F = v F vy x F = F v y v x s x sx F s y s y F If the confidence interval includes it is reasonable to conclude that the two population variances are e qual.
4. Hypothesis Testing For each type of test state Null hypothesis H 0 : m X = m Y Alternative Hypothesis H : m x > m y ( tail test) State significance level and distribution Determine critical value/ region sketch graph Calculate the appropriate test statistic Conclude accept or reject H o in favour of H MEANS s is the pooled sample variance Difference between means Two small samples and n have mean values x and x Test statistic t = x x s + n Distribution Use t- tables +n degrees of freedom H : m ¹ m x y H : m < m x y H : m > m x y tailed test tailed test tailed test Assumptions - the two populations are Normal - the two populations have the same variance Remember to divide your rejection region critical value Significance level by or critical region Difference between matches pairs Paired Samples If samples can be paired exactly, the difference between the pairs of values can be tested to see if they form a distribution with zero mean, assumed to be normal As we don t known the population variances of these differences - use the t-distribution Test statistic t = d 0 d is the mean of the differences of the matches pairs s is the unbiased estimate of the variance of the differences of s the matches pairs n Distribution t distribution with n - degrees of freedom ( n= number of pairs used)
VARIANCES (Standard Deviations always work in terms of variance) Tests about a SINGLE population variance tail test tailed test Test Statistic Distribution H 0 : s =s 0 H 0 : s = s 0 Chi-squared H : s > s 0 H : s (n )s s 0 c = s n- degrees of freedom 0 H : s < s 0 Assumption population is approximately normal Comparison of population variances - can be used to check that the variances are roughly the same (one of the assumptions needed to use the t-distribution when comparing means) - uses the ratio of the two population variances compares to - always have the larger variance as the numerator ONE TAILED TESTS (rare in an exam) NUMERATOR n - degrees of freedom H : s > s or H : s > s Rejection Region F > F F = s V Test statistic s F = V Test statistic s TWO TAILED TESTS H : s s F = s s if s > s s or F = s if s > s s DENOMINATOR n - degrees of freedom Rejection Region - as above but remember to divide the significance level by A scientist records lengths of worms in fields A and B Field A (cm).9 9.8 0.5 0.8 9.5.3 Field B (cm).3 3.4 0. 3.6 4. Assuming that these are random samples from normal populations, test at the 5% significance level that the population variances are equal. H 0 : s A = s B H : s A s B F distribution tailed test 0.975 in tables S A = 0.845 n A = 6 Degrees of freedom = 5 S B =.545 n B = 5 Degrees of freedom = 4.545 Test Statistic F = 0.845 = 3.5 F 54 = 7.39 As 3.5 < 7.39 no significant evidence at the 5% level to indicate that the variances are not equal : Accept H 0
5. Goodness of Fit Chi-Squared www.mathsbox.org.uk n X = S i = (O i E i ) E i Expected Frequencies must be 5 Degrees of freedom : if there are k groups (in your X calculation) and p parameters are estimated, then the no. of degrees of freedom is k-p-. The observed X is compared with c one-sided tables. If X is too high, we reject the hypothesis that this is the correct model for the distribution. Formula book contains the functions for the Binomial use n - degrees of freedom if you have estimated p from the data Poisson use n degrees of freedom if you have estimated λ from the data you may need to make the last group k Geometric you may need to make the last group k Uniform also test for independence e.g. if number customers is independent of the day of the week then each day would have the same frequency Normal standardise and use tables to find the probabilities z = x m s n 3 degrees of freedom if mean and variance estimated from data use and at the lower and upper limits to ensure all covered Analysis of the goals scored per match by a football team gave the following results. Goals per match (x) 0 3 4 5 6 7 Matches (f) 4 8 9 8 0 7 3 Test at the 5% level whether the distribution can be modelled by a Poisson distribution. ALWAYS start with a hypothesis H 0 : The distribution is Poisson Significance Level 5% From the data mean λ =.3 P(X = x) = e 3 (3) x 0 3 4 5 6 7 8 Observed 4 8 9 8 0 7 3 0 Expected 0.0 3. 6.5 0.3.7 5.4. 0.7 0. x! Extra group added X = 4.75 compare to c (5%) with 4 degrees of freedom c = 9.49 As X < 9.49 we do not reject H 0 and conclude that the distribution follows a Poisson Distribution having the same mean.