STATISTICS 1 REVISION NOTES

Similar documents
Statistics 1. Edexcel Notes S1. Mathematical Model. A mathematical model is a simplification of a real world problem.

Counting principles, including permutations and combinations.

Statistics 1. Revision Notes

A-Level Maths Revision notes 2014

Expectation, Variance and Standard Deviation for Continuous Random Variables Class 6, Jeremy Orloff and Jonathan Bloom

Statistics S1 Advanced/Advanced Subsidiary

Mark Scheme (Results) January 2009

Teacher: Angela (AMD)

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

For use only in Badminton School November 2011 S1 Note. S1 Notes (Edexcel)

PhysicsAndMathsTutor.com

1. Exploratory Data Analysis

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Solutionbank S1 Edexcel AS and A Level Modular Mathematics

Chapter 4. Displaying and Summarizing. Quantitative Data

Learning Objectives for Stat 225

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

(c) Find the product moment correlation coefficient between s and t.

Edexcel GCE Statistics S1 Advanced/Advanced Subsidiary

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

Edexcel GCE Statistics S1 Advanced/Advanced Subsidiary

AP Final Review II Exploring Data (20% 30%)

Steve Smith Tuition: Maths Notes

SDS 321: Introduction to Probability and Statistics

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Sets and Set notation. Algebra 2 Unit 8 Notes

Paper Reference(s) 6683 Edexcel GCE Statistics S1 Advanced/Advanced Subsidiary Thursday 5 June 2003 Morning Time: 1 hour 30 minutes

YEAR 12 - Mathematics Pure (C1) Term 1 plan

MATH 1150 Chapter 2 Notation and Terminology

Keystone Exams: Algebra

Algebra I+ Pacing Guide. Days Units Notes Chapter 1 ( , )

CS 160: Lecture 16. Quantitative Studies. Outline. Random variables and trials. Random variables. Qualitative vs. Quantitative Studies

Chapter 6 The Standard Deviation as a Ruler and the Normal Model

TOPIC: Descriptive Statistics Single Variable

Revised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review

Candidates may use any calculator allowed by the regulations of the Joint Council for Qualifications. Calculators must not have the facility for

ALGEBRA 1 KEYSTONE. Module 1 and Module 2 both have 23 multiple choice questions and 4 CRQ questions.

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

STA1000F Summary. Mitch Myburgh MYBMIT001 May 28, Work Unit 1: Introducing Probability

Probability Year 10. Terminology

Calculus first semester exam information and practice problems

Chapter 2 Solutions Page 15 of 28

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

The aim of this section is to introduce the numerical, graphical and listing facilities of the graphic display calculator (GDC).

Arkansas Tech University MATH 3513: Applied Statistics I Dr. Marcel B. Finan

Key Facts and Methods

Probability Year 9. Terminology

Index I-1. in one variable, solution set of, 474 solving by factoring, 473 cubic function definition, 394 graphs of, 394 x-intercepts on, 474

dates given in your syllabus.

Statistics S1 Advanced/Advanced Subsidiary

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Pink)

Mark Scheme (Results) January 2010

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

Continuous random variables

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Mark Scheme (Results) Summer 2009

The empirical ( ) rule

Week 2: Review of probability and statistics

GEOMETRIC -discrete A discrete random variable R counts number of times needed before an event occurs

Continuous Expectation and Variance, the Law of Large Numbers, and the Central Limit Theorem Spring 2014

Lecture 1 : Basic Statistical Measures

Dover- Sherborn High School Mathematics Curriculum Probability and Statistics

DIFFERENTIATION AND INTEGRATION PART 1. Mr C s IB Standard Notes

Introduction to Measurement Physics 114 Eyres

Elementary Statistics

EXAM. Exam #1. Math 3342 Summer II, July 21, 2000 ANSWERS

Determining the Spread of a Distribution

Edexcel past paper questions

Please bring the task to your first physics lesson and hand it to the teacher.

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

Chapter 7: Random Variables

Chapter 1 Review of Equations and Inequalities

Determining the Spread of a Distribution

STATISTICS 141 Final Review

Lecture 2 and Lecture 3

Chapter 5. Understanding and Comparing. Distributions

Week Topics of study Home/Independent Learning Assessment (If in addition to homework) 7 th September 2015

Mark Scheme (Results) Summer Pearson Edexcel GCE in Statistics S1 (6683/01)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

Stat 101 L: Laboratory 5

Units. Exploratory Data Analysis. Variables. Student Data

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Chapter 2: Tools for Exploring Univariate Data

Review Notes for IB Standard Level Math

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

PhysicsAndMathsTutor.com. Advanced/Advanced Subsidiary. You must have: Mathematical Formulae and Statistical Tables (Blue)

Wednesday, 24 May Warm-Up Session. Non-Calculator Paper

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Class 8 Review Problems 18.05, Spring 2014

The First Derivative Test

BNG 495 Capstone Design. Descriptive Statistics

6683/01 Edexcel GCE Statistics S1 Silver Level S1

Quadratic Equations Part I

Describing distributions with numbers

Before this course is over we will see the need to split up a fraction in a couple of ways, one using multiplication and the other using addition.

Core 1 Module Revision Sheet J MS. 1. Basic Algebra

MATH 118 FINAL EXAM STUDY GUIDE

Transcription:

STATISTICS 1 REVISION NOTES Statistical Model Representing and summarising Sample Data Key words: Quantitative Data This is data in NUMERICAL FORM such as shoe size, height etc. Qualitative Data This is data in NON-NUMERICAL FORM such as eye colour, place of birth Continuous Data This is data you can measure and can take any value within a given range. Example of continuous data would be height, weight, time Discrete Data This is often data you can count and can only take particular values. Examples of discrete data would be shoe size, age in years, cost in and p etc. Population The entire set of data that could be potentially be sampled Sample A proportion of the population Box and Whisker Diagrams MUST be drawn on graph paper MUST be drawn using a ruler Don t forget to include a scale and label the scale (including units) IF they ask for outliers, mark these on with a cross and the whiskers go out to the largest or smalles NON-OUTLIER piece of data in the distribution If they DON T ask for outliers, don t try to find any When asked to compare distributions focus on MEDIAN, IQR, SKEW and one other factor you haven t looked at, such as range. Start by STATING THE OBVIOUS (eg distribution A has the largest median) then explain what this means IN CONTEXT (eg. This means that distribution A spends longer watching TV per week).

Histograms Remember that frequency is proportional to area Frequency Density = If the data is presented like this The LOWER BOUND of the first group is 20 The UPPER BOUND of the first group is 30 The CLASS WIDTH will be 10 The FREQUENCY DENSITY will be 0.4 (4 10) Time Frequency 20 < t 30 4 30 < t 40 12 40 < t 50 15 If the data is presented like this The LOWER BOUND of the first group is 19.5 The UPPER BOUND of the first group is 29.5 The CLASS WIDTH will be 10 The FREQUENCY DENSITY will be 0.4 (4 10) Time Frequency 20 29 4 30 40 12 40 49 15 Finding Quartiles, Deciles and Percentiles When the data is just a set of numbers or in a stem and leaf diagram Median halfway point (remember even number of pieces of data implies TWO middle values, odd number implies ONE), Lower Quartile median of the FIRST HALF of the data use the median value to split the data in half if there is one middle number, the data either side of this (NOT including the median number) will be the two halves. If there are two middle numbers, the first is the last number of the first half of the data, the second is the first number of the 2 nd half of the data Upper Quartile the median of the SECOND HALF of the data For deciles divide the number of pieces of data by 10 and multiply by the decile you want. If it is a decimal, round UP. This gives its position in the set of data. For percentiles follow same procedure as deciles except of course divide by 100. When data is in the form of a grouped frequency table, you will be using LINEAR INTERPOLATION to find the quantile. Procedure: 1. Find the group the particular quantile lies in (remember it is an ESTIMATE so don t worry about whether there is one or two middle numbers for median) 2. You can use a formula but the easiest method is to use a ratio approach Let us assume we are trying to find the lower quartile for a time measured in seconds. There are 200 pieces of data so the LQ is the 50 th piece of data. This will be in the 20 29 group so the lower bound is 19.5 seconds and the upper bound is 29.5 seconds The line below represents the data from lower to upper bound. On the top part use the label scale (in this case seconds) and the bottom half the frequency. LQ 19.5 29.5 18 Sum of Frequencies Up to this group 50 71 Sum of frequencies including this group. Time (secs) Frequency 10 19 18 20 29 53 30 39 61 40 49 44 50 59 20 60 69 4 We can now use the fact that the proportions on the top part of the line have to be the same as the bottom part of the line LQ 19.5 50 18 = 29.5 19.5 71 18 You can now solve this equation to find LQ. You can use the same method to find ANY quantile.

Mean and Standard Deviation Mean = if data is a set of numbers or mid-point of the group. if data is in a table if table is grouped, x is the Standard deviation = ( ) if data is a set of numbers or ( ) is NOT ( ) it means work out x 2 FIRST, multiply these by f and then add the results. Standard deviation and mean have the same units as each other and the data being measured. Variance is the same formula as standard deviation WITHOUT the square-root sign. Standard deviation is the square root of variance. Use of coding Coding formulas are used to make the data easier to manage. Coding affects both mean AND standard deviation. If the coding formula used is = then you find the mean and standard deviation of the coded data. Now make X the subject = This is the formula you will use to UNDO the effects of the coding on the mean. The standard deviation is NOT affected by adding or subtracting numbers (provided this is done to ALL the data) and so when correcting for the standard deviation just perform the multiplication/division parts. Skew POSITIVE SKEW SYMMETRICAL NEGATIVE SKEW Q3 Q2 > Q2 Q1 Q3 Q2 > Q2 Q1 Q3 Q2 < Q2 Q1 mode < median < mean mode = median = mean mode > median > mean When you are asked to justify skew think about what you have found earlier. If you ve been asked to find median and mean (and maybe mode) use the mode, median, mean justifications. If you ve been asked to find the quartiles, use the quartile justifications. You must us actual values for these and not just descriptive terms.

Probability Venn Diagrams These are particularly helpful to answer certain probability questions Addition rule P(A B) = P(A) + P(B) + P(A B) Mutually exclusive If A and B are mutually exclusive they CANNOT happen at the same time. If A and B are mutually exclusive then P(A B) = 0 Independent If A and B are independent then this means the result of A does not affect the chances of B happening and vice versa. If A and B are independent then P(A B) = P(A) x P(B) Only use the fact that P(A B) = P(A) x P(B) when you are explicitly told that the two events are independent. Otherwise assume they are not and you cannot use this fact. If you are asked to determine whether two events are independent then you will need to have found (or will have to find) P(A B) and then you need to calculate P(A) x P(B) actually writing this calculation (and its result) down. Then if this gives a value the same as P(A B) then they ARE independent, otherwise they are not make sure you finish with a statement to this effect. Correlation and Regression All the formulae required for product moment correlation coefficient are given in the formula booklet you get in the exam. S = ( ) = ( ) S = ( ) = ( ) S = ( )( ) = ( )( ) = Use the right hand versions for Sxx, Syy and Sxy r should be a value between -1 and 1. Anything between about -0.7 and -1 is evidence of good linear negative correlation. Between 0.7 and 1 of good positive linear correlation. Between 0.7 and -0.7 the quality of correlation becomes less reliable. Product moment correlation coefficient IS NOT altered when using coding.

Evidence for a linear regression model can be found if you have drawn a scatter graph and it appears to show good linear correlation (negative or positive) or you have found the PMCC and this shows the same result. The linear regression model is y = a + bx where x is the INDEPENDENT or EXPLANATORY VARIABLE because these depend on what the experimenter is doing i.e. the person doing the experiment control these. y is the DEPENDENT or RESPONSE VARIABLE because this is dependent on x. a and b are constants which need to be found essentially equivalent to c and m respectively in y = mx + c (i.e. the y-intercept and gradient). Again all these formulae are given in the exam formula booklet = and = (the bars at the top of x and y indicate the mean of x and y) Coding DOES affect the regression model, so when correcting for the coding, you will need to change x and y for the coding formula used and rearrange and simplify to get the relevant regression model. If you are asked to INTERPRET either value of a and b you MUST do this in CONTEXT. y = a + bx What y is when x is 0 What y increases by (or decreases if b is negative) when x increases by 1 For example, if the regression model for the length in cm of a spring (L) when a mass (m) in g is hung from it is L = 0.04x + 16.2 Interpretation of the constants (in context) would be a = 16.2 so when there is no mass attached the spring has a length of 16.2cm. b = 0.04 so for every gram that is added to the spring, it extends by 0.04cm. Random variables Key Words Random Variable A variable that represents the value obtained when you take a measurement from a real world experiment. Probability Distribution The set of all possible random variables together with their associated probabilities this is usually presented as a table Probability Distribution Function (pdf) This is the function that decides how the probabilities for each random variable are assigned. Denoted by a lower case f, so lookout for f(x) Cumulative Distribution Function (cdf) Similar to the pdf except the probabilities are CUMULATIVE, so the final one will be 1. It will tell you the probability that X that particular random variable. Denoted by a capital letter F, so lookout for F(X) Expected Value [E(X)] Like the mean. If an experiment is repeated many times, it is the value, on average, you would get.

If a probability distribution function, f(x) has random variables x1, x2, x3, and associated probabilities P(X=x1), P(X=x2), P(X=x3), Then Expected Value, E(X) = ( = ) i.e. for each random variable, multiply it by its associated probability and E(X) is the sum of all these. Variance, Var(X) = ( ) [ ( )] where ( ) = ( = ) (i.e. same as E(X) except you have to square the random variable before multiplying it by its probability. Linear functions of a random variable If a probability function, f(x) is altered using a linear function to f(ax + b) then E(aX + b) and Var(aX + b) can be found using the original values for E(X) and Var(X). Remember that adding or subtracting the same value to each random variable affects the mean but NOT the spread (i.e. the variance) but multiplying or dividing affects BOTH. Also remember that Variance is standard deviation squared So, if f(x) has expected value E(X) and variance Var(X) and it is transformed to f(ax + b) then E(aX + b) = ae(x) + b Var(aX + b) = a 2 Var(X) (i.e. IGNORE the +b when finding Var but remember to square the a) Discrete Uniform Distribution This is a particular discrete distribution where each random variable is equally likely. A simple example would be the probability function of a the number shown when a fair sixsided die is thrown. The random variables would be 1 to 6 each with an associated probability of 1 /6. A discrete uniform distribution is defined of a set of n distinct values where each outcome is equally likely. For a discrete uniform distribution P(X = x) = We can also easily find E(X) and Var(X) without resorting to the usual formulae (those these also work). However, for these formulae to work, the random variables have to be 1, 2, 3,., n. If they aren t this, you will need to perform a linear transformation, f(ax + b), to make this happen before you can use these particular formulae If X = 1, 2, 3, 4,., n then: ( ) = and ( ) = ( )( )

Normal Distribution Some basic points The normal distribution is a CONTINUOUS distribution NOT a discrete distribution. It is centred on the mean and is symmetrical about it. The area under the curve is equivalent to the probability, so the total area under the curve is 1. The mean = median It is asymptotic about the x-axis. A normal distribution is defined using its mean (μ) and variance (σ 2 ) and we write a particular normally distributed model as X ~ N(μ, σ 2 ) When calculating give Z values to 2 decimal places and probabilities to 4 d.p. s. The large table is usually used if you know the Z value and want to find the probability the p values give you the P(Z < z) The small table is usually used if you know the probability (and it is a simple value like 1%, 5%, 10% etc) and you need to find the z value. It is the reverse of the larger table so the z and p value give you P(Z>z) If you are using the big table and need P(Z > z) then it is 1 P(Z < z), if you need P(Z < -z) then it is 1 P(Z < z) and if it is both, i.e. P(Z > -z) then they cancel out so it is the same as P(Z < z). If you are ever asked for P(X = x) the answer is ALWAYS 0 (if you do S2 you will learn about something called a continuity correction that gets around this, but for S1 purposes the answer is 0) About 2 /3 of the data lies between 1 standard deviation of the mean if the data is normally distributed. About 95% of the data lies between 2 standard deviations of the mean if the data is normally distributed. About 99.7% of the data lies between 3 standard deviations of the mean if the data is normally distributed. Changing the mean (but not the standard deviation) translates the curve along the x-axis but leaves the shape exactly the same. Increasing the standard deviation (but not the mean) means the spread is larger, so the curve becomes lower but fatter (not it stays centred on the same mean value). Decreasing the standard deviation makes the curve narrower but taller

To find probabilities associated with the normal distribution, we need to STANDARDISE the distribution. This converts a normal distribution defined as X ~ N(μ, σ 2 ) to the standard normal distribution (Z) which has a mean of 0 and a standard deviation of 1, i.e. Z ~ N(0, 1 2 ). To do this we use the standardising formula... = If X ~ N((μ, σ 2 ) and you need to find P(a < X < b) you are finding the area under the normal curve between a and b First standardise a and b using the standardising formula (above) So, P(a < X < b) becomes ( ) Then, use your tables to work out ( ) ( ) a b