STATISTICS 1 REVISION NOTES Statistical Model Representing and summarising Sample Data Key words: Quantitative Data This is data in NUMERICAL FORM such as shoe size, height etc. Qualitative Data This is data in NON-NUMERICAL FORM such as eye colour, place of birth Continuous Data This is data you can measure and can take any value within a given range. Example of continuous data would be height, weight, time Discrete Data This is often data you can count and can only take particular values. Examples of discrete data would be shoe size, age in years, cost in and p etc. Population The entire set of data that could be potentially be sampled Sample A proportion of the population Box and Whisker Diagrams MUST be drawn on graph paper MUST be drawn using a ruler Don t forget to include a scale and label the scale (including units) IF they ask for outliers, mark these on with a cross and the whiskers go out to the largest or smalles NON-OUTLIER piece of data in the distribution If they DON T ask for outliers, don t try to find any When asked to compare distributions focus on MEDIAN, IQR, SKEW and one other factor you haven t looked at, such as range. Start by STATING THE OBVIOUS (eg distribution A has the largest median) then explain what this means IN CONTEXT (eg. This means that distribution A spends longer watching TV per week).
Histograms Remember that frequency is proportional to area Frequency Density = If the data is presented like this The LOWER BOUND of the first group is 20 The UPPER BOUND of the first group is 30 The CLASS WIDTH will be 10 The FREQUENCY DENSITY will be 0.4 (4 10) Time Frequency 20 < t 30 4 30 < t 40 12 40 < t 50 15 If the data is presented like this The LOWER BOUND of the first group is 19.5 The UPPER BOUND of the first group is 29.5 The CLASS WIDTH will be 10 The FREQUENCY DENSITY will be 0.4 (4 10) Time Frequency 20 29 4 30 40 12 40 49 15 Finding Quartiles, Deciles and Percentiles When the data is just a set of numbers or in a stem and leaf diagram Median halfway point (remember even number of pieces of data implies TWO middle values, odd number implies ONE), Lower Quartile median of the FIRST HALF of the data use the median value to split the data in half if there is one middle number, the data either side of this (NOT including the median number) will be the two halves. If there are two middle numbers, the first is the last number of the first half of the data, the second is the first number of the 2 nd half of the data Upper Quartile the median of the SECOND HALF of the data For deciles divide the number of pieces of data by 10 and multiply by the decile you want. If it is a decimal, round UP. This gives its position in the set of data. For percentiles follow same procedure as deciles except of course divide by 100. When data is in the form of a grouped frequency table, you will be using LINEAR INTERPOLATION to find the quantile. Procedure: 1. Find the group the particular quantile lies in (remember it is an ESTIMATE so don t worry about whether there is one or two middle numbers for median) 2. You can use a formula but the easiest method is to use a ratio approach Let us assume we are trying to find the lower quartile for a time measured in seconds. There are 200 pieces of data so the LQ is the 50 th piece of data. This will be in the 20 29 group so the lower bound is 19.5 seconds and the upper bound is 29.5 seconds The line below represents the data from lower to upper bound. On the top part use the label scale (in this case seconds) and the bottom half the frequency. LQ 19.5 29.5 18 Sum of Frequencies Up to this group 50 71 Sum of frequencies including this group. Time (secs) Frequency 10 19 18 20 29 53 30 39 61 40 49 44 50 59 20 60 69 4 We can now use the fact that the proportions on the top part of the line have to be the same as the bottom part of the line LQ 19.5 50 18 = 29.5 19.5 71 18 You can now solve this equation to find LQ. You can use the same method to find ANY quantile.
Mean and Standard Deviation Mean = if data is a set of numbers or mid-point of the group. if data is in a table if table is grouped, x is the Standard deviation = ( ) if data is a set of numbers or ( ) is NOT ( ) it means work out x 2 FIRST, multiply these by f and then add the results. Standard deviation and mean have the same units as each other and the data being measured. Variance is the same formula as standard deviation WITHOUT the square-root sign. Standard deviation is the square root of variance. Use of coding Coding formulas are used to make the data easier to manage. Coding affects both mean AND standard deviation. If the coding formula used is = then you find the mean and standard deviation of the coded data. Now make X the subject = This is the formula you will use to UNDO the effects of the coding on the mean. The standard deviation is NOT affected by adding or subtracting numbers (provided this is done to ALL the data) and so when correcting for the standard deviation just perform the multiplication/division parts. Skew POSITIVE SKEW SYMMETRICAL NEGATIVE SKEW Q3 Q2 > Q2 Q1 Q3 Q2 > Q2 Q1 Q3 Q2 < Q2 Q1 mode < median < mean mode = median = mean mode > median > mean When you are asked to justify skew think about what you have found earlier. If you ve been asked to find median and mean (and maybe mode) use the mode, median, mean justifications. If you ve been asked to find the quartiles, use the quartile justifications. You must us actual values for these and not just descriptive terms.
Probability Venn Diagrams These are particularly helpful to answer certain probability questions Addition rule P(A B) = P(A) + P(B) + P(A B) Mutually exclusive If A and B are mutually exclusive they CANNOT happen at the same time. If A and B are mutually exclusive then P(A B) = 0 Independent If A and B are independent then this means the result of A does not affect the chances of B happening and vice versa. If A and B are independent then P(A B) = P(A) x P(B) Only use the fact that P(A B) = P(A) x P(B) when you are explicitly told that the two events are independent. Otherwise assume they are not and you cannot use this fact. If you are asked to determine whether two events are independent then you will need to have found (or will have to find) P(A B) and then you need to calculate P(A) x P(B) actually writing this calculation (and its result) down. Then if this gives a value the same as P(A B) then they ARE independent, otherwise they are not make sure you finish with a statement to this effect. Correlation and Regression All the formulae required for product moment correlation coefficient are given in the formula booklet you get in the exam. S = ( ) = ( ) S = ( ) = ( ) S = ( )( ) = ( )( ) = Use the right hand versions for Sxx, Syy and Sxy r should be a value between -1 and 1. Anything between about -0.7 and -1 is evidence of good linear negative correlation. Between 0.7 and 1 of good positive linear correlation. Between 0.7 and -0.7 the quality of correlation becomes less reliable. Product moment correlation coefficient IS NOT altered when using coding.
Evidence for a linear regression model can be found if you have drawn a scatter graph and it appears to show good linear correlation (negative or positive) or you have found the PMCC and this shows the same result. The linear regression model is y = a + bx where x is the INDEPENDENT or EXPLANATORY VARIABLE because these depend on what the experimenter is doing i.e. the person doing the experiment control these. y is the DEPENDENT or RESPONSE VARIABLE because this is dependent on x. a and b are constants which need to be found essentially equivalent to c and m respectively in y = mx + c (i.e. the y-intercept and gradient). Again all these formulae are given in the exam formula booklet = and = (the bars at the top of x and y indicate the mean of x and y) Coding DOES affect the regression model, so when correcting for the coding, you will need to change x and y for the coding formula used and rearrange and simplify to get the relevant regression model. If you are asked to INTERPRET either value of a and b you MUST do this in CONTEXT. y = a + bx What y is when x is 0 What y increases by (or decreases if b is negative) when x increases by 1 For example, if the regression model for the length in cm of a spring (L) when a mass (m) in g is hung from it is L = 0.04x + 16.2 Interpretation of the constants (in context) would be a = 16.2 so when there is no mass attached the spring has a length of 16.2cm. b = 0.04 so for every gram that is added to the spring, it extends by 0.04cm. Random variables Key Words Random Variable A variable that represents the value obtained when you take a measurement from a real world experiment. Probability Distribution The set of all possible random variables together with their associated probabilities this is usually presented as a table Probability Distribution Function (pdf) This is the function that decides how the probabilities for each random variable are assigned. Denoted by a lower case f, so lookout for f(x) Cumulative Distribution Function (cdf) Similar to the pdf except the probabilities are CUMULATIVE, so the final one will be 1. It will tell you the probability that X that particular random variable. Denoted by a capital letter F, so lookout for F(X) Expected Value [E(X)] Like the mean. If an experiment is repeated many times, it is the value, on average, you would get.
If a probability distribution function, f(x) has random variables x1, x2, x3, and associated probabilities P(X=x1), P(X=x2), P(X=x3), Then Expected Value, E(X) = ( = ) i.e. for each random variable, multiply it by its associated probability and E(X) is the sum of all these. Variance, Var(X) = ( ) [ ( )] where ( ) = ( = ) (i.e. same as E(X) except you have to square the random variable before multiplying it by its probability. Linear functions of a random variable If a probability function, f(x) is altered using a linear function to f(ax + b) then E(aX + b) and Var(aX + b) can be found using the original values for E(X) and Var(X). Remember that adding or subtracting the same value to each random variable affects the mean but NOT the spread (i.e. the variance) but multiplying or dividing affects BOTH. Also remember that Variance is standard deviation squared So, if f(x) has expected value E(X) and variance Var(X) and it is transformed to f(ax + b) then E(aX + b) = ae(x) + b Var(aX + b) = a 2 Var(X) (i.e. IGNORE the +b when finding Var but remember to square the a) Discrete Uniform Distribution This is a particular discrete distribution where each random variable is equally likely. A simple example would be the probability function of a the number shown when a fair sixsided die is thrown. The random variables would be 1 to 6 each with an associated probability of 1 /6. A discrete uniform distribution is defined of a set of n distinct values where each outcome is equally likely. For a discrete uniform distribution P(X = x) = We can also easily find E(X) and Var(X) without resorting to the usual formulae (those these also work). However, for these formulae to work, the random variables have to be 1, 2, 3,., n. If they aren t this, you will need to perform a linear transformation, f(ax + b), to make this happen before you can use these particular formulae If X = 1, 2, 3, 4,., n then: ( ) = and ( ) = ( )( )
Normal Distribution Some basic points The normal distribution is a CONTINUOUS distribution NOT a discrete distribution. It is centred on the mean and is symmetrical about it. The area under the curve is equivalent to the probability, so the total area under the curve is 1. The mean = median It is asymptotic about the x-axis. A normal distribution is defined using its mean (μ) and variance (σ 2 ) and we write a particular normally distributed model as X ~ N(μ, σ 2 ) When calculating give Z values to 2 decimal places and probabilities to 4 d.p. s. The large table is usually used if you know the Z value and want to find the probability the p values give you the P(Z < z) The small table is usually used if you know the probability (and it is a simple value like 1%, 5%, 10% etc) and you need to find the z value. It is the reverse of the larger table so the z and p value give you P(Z>z) If you are using the big table and need P(Z > z) then it is 1 P(Z < z), if you need P(Z < -z) then it is 1 P(Z < z) and if it is both, i.e. P(Z > -z) then they cancel out so it is the same as P(Z < z). If you are ever asked for P(X = x) the answer is ALWAYS 0 (if you do S2 you will learn about something called a continuity correction that gets around this, but for S1 purposes the answer is 0) About 2 /3 of the data lies between 1 standard deviation of the mean if the data is normally distributed. About 95% of the data lies between 2 standard deviations of the mean if the data is normally distributed. About 99.7% of the data lies between 3 standard deviations of the mean if the data is normally distributed. Changing the mean (but not the standard deviation) translates the curve along the x-axis but leaves the shape exactly the same. Increasing the standard deviation (but not the mean) means the spread is larger, so the curve becomes lower but fatter (not it stays centred on the same mean value). Decreasing the standard deviation makes the curve narrower but taller
To find probabilities associated with the normal distribution, we need to STANDARDISE the distribution. This converts a normal distribution defined as X ~ N(μ, σ 2 ) to the standard normal distribution (Z) which has a mean of 0 and a standard deviation of 1, i.e. Z ~ N(0, 1 2 ). To do this we use the standardising formula... = If X ~ N((μ, σ 2 ) and you need to find P(a < X < b) you are finding the area under the normal curve between a and b First standardise a and b using the standardising formula (above) So, P(a < X < b) becomes ( ) Then, use your tables to work out ( ) ( ) a b