F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Size: px

Start display at page:

Download "F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives"

Jack Stevens
5 years ago
Views:

1 F78SC2 Notes 2 RJRC Algebra It is useful to use letters to represent numbers. We can use the rules of arithmetic to manipulate the formula and just substitute in the numbers at the end. Example: 100 invested for 2 years. If we use x to represent the interest rate. After one year, the value will be 100(1 + x) After 2 years, the value will be 100(1 + x) 2 If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives Example: 100 invested. What interest rate would give a value of 115 after 2 years? Require 100(1 + x) 2 = 115 So (1 + x) 2 = 1.15 So 1 + x = So x = 7.24% Revision: Rules of Arithmetic Order of calculations: 1. Evaluate expressions within brackets. 2. Evaluate functions (e.g. square root, power, log, exp). 3. Evaluate multiplications and divisions. 4. Evaluate additions and subtractions. Note that multiplying two negative numbers gives a positive answer. Examples (using positive value of square root): = (1 + 3) 9 = = = ( 1) 2 + ( 3) 2 = (1 + 3) 2 = 16 Minus Signs Note: 1 2 = 1 is not the same as ( 1) 2 = +1 What is meant by expressions such as: 1 2? Answer: Take the negative number 1 and then subtract 2 from it. Note that the minus sign is used in two different ways: 1

2 As a unary operator to change 1 to 1 As a binary operator to subtract 2 from 1 The above shows how to interpret expressions like 1 2. The second minus sign is a unary minus, so 2 is subtracted from 1, giving the answer 3. 2

3 Histograms Suitable for continuous data. Data values are grouped into Bins and the numbers in each bin are counted. Bins need not be all of the same width. Examples: Student heights, House prices. Construction: 1. Choose bin width so that there are between 6 and 20 bins. Use more bins for larger sample sizes. First bin should contain the minimum value and last should contain the maximum. 2. Construct the frequency table. 3. Draw histogram. The Area of each rectangle should be proportional to the number in the bin. Note: Minitab does not allow unequal bin widths. Drawing Histograms Rule for unequal bin widths: Area Frequency Height Width Frequency Height Frequency Width Example: The Edinburgh house price data is a case where different bin widths should be used. Band Width Freq. F W 200F W The height of each rectangle must be proportional to the value in the penultimate column of the table it may be more convenient when plotting by hand to multiply these values by a suitable constant, such as

4 Figure 1: Histrogram of Student heights, equal bin sizes Figure 2: Histrogram of House Prices, equal bin sizes 4

5 Histogram of Price Density Price Figure 3: Histrogram of House Prices, unequal bin sizes Stem-and-Leaf Plots Similar to histogram, but contains much more information. Bin widths are taken as a power of 10 times either 2 or 5 or 10. Suitable for continuous or discrete data. Example: Heights (in cm.) of a random sample of 60 students: Find minimum and maximum (155, 191). Decide on bin size 2 s will give 19 bins and 5 s will give 8, so use latter. Write Stem 15, 16, 16, 17, 17, 18, 18, 19. 5

6 Go through data set writing next digit ( Leaf ) against the appropriate stem value. Where a stem value is repeated, leaves 0 to 4 go against the first and 5 to 9 against the second. Rewrite table with leaf values in ascending order. Optional extra: add cumulative frequencies to the plot. Note that the numbers in a stem-and-leaf plot may be truncated. statistics obtained by using them are sometimes rather too small. Thus any summary Minitab output. Stem-and-leaf of heights N = 60 Leaf Unit = (20) Notes: The first column above gives cumulative frequencies; these are calculated from below and from above. The (20) indicates that there are 20 students in the category where these meet. The reason that there are only a limited number of different leaf values is because the original values were recorded in inches and then converted to centimetres. The following is a plot of the same data in inches here each bin on the stem has two leaf values associated with it. Stem-and-leaf of C2 N = 60 Leaf Unit = (19)

7 Note that changes in the starting value and to the bin width mean that the outline has a slightly different shape from the previous plot. Graphical methods are often a useful first step in looking at sample data and are also useful for presenting conclusions to others. If we wish to draw conclusions about the data, we usually need to use numerical methods. Averages (Rees ) There are many summary statistics that can be calculated for quantitative variables. Those that give the location are the most important. There are several ways of describing where the centre of the data is: the sample mean and the median are the commonest. Sample Mean The following symbols will be used to describe the calculations: Sample size: n (number of values in sample) Data values: Sample mean: x 1, x 2,..., x n x x = sum of data values sample size = x 1 + x x n n = 1 n x i n i=1 The suffix notation is used to make it clear which values are being added. This can often be omitted: x x = n Examples: Data: 1, 2, 3 Total = 6 Mean = 2.0 Data: 4, 6 Total = 10 Mean = 5.0 Data: 1, 2, 3, 4, 6 Total = 16 Mean = 3.2 Note that the overall mean is not the average of the two separate means. Example: Failure times. Fourteen electrical components were tested to destruction. Failure times in hours were: 7

8 n = 14 xi = = 1138 x = = 81.3 Mean failure time = 81.3 hours To calculate an overall mean from group means, it is important to allow for any differences in sample sizes. One way to do this is by calculating the individual totals. These are then added to get the overall total and hence the overall mean. Example: Mean age of 50 males = 22.6 years Mean age of 30 females = 19.4 years Mean age of combined group of 80 people? Male total = = 1130 Female total = = 582 Mean age = = 21.4 years Example: Mean salary of 50 people = Mean salary of 30 males = Overall total = = Male total = = Female total = = Female mean = = Sometimes data is available as a frequency table. Example: Articles produced in a manufacturing process were examined by taking regular samples of 20 articles. The number of defective articles in each sample was noted. The following data were obtained: Number of defectives Frequency

9 Note that the 11 in the frequency table means that the value 0 occurred 11 times. So: n = = 40 xi = = = = 65 x = = The general rule for frequency data is: x = fi x i fi where f i is the frequency of outcome x i Averages Median The median is the middle value when the data have been sorted. The median is used when the data set contains extreme values which would distort the mean. For example, in data set B (failures), x = This value is rather larger than what might be considered a typical value from this data set. The data must be sorted before the median can be calculated. If we have n data points then: if n is odd, it is the ( ) n+1 2 sorted value. if n is even, it is the average of the ( ) ( n 2 and the n + 2 1) sorted values. Example: Failure times Sorted data: 4, 5, 10, 28, 37, 45, 55, 75, 76, 82, 102, 139, 197, 283 n = 14 so require the average of the 7th and 8th sorted values. So median = = 65 hours. The Median, Robustness and Skewness The mean is the best measure of location if the shape of the plotted data is roughly symmetric. The median is a more robust measure of what is typical than the mean. i.e. the 9

10 median is not really affected by a few extreme values. Example: Consider the two artificial data sets {1, 2, 3, 4, 5} and {1, 2, 3, 4, 90}. Both sets have a median of 3, but the mean of the first is 3 and the mean of the second is 20. Example: Failure data. The extreme value of 283 distorted the mean. This is an example of data that are skew. A plot shows that the data are not symmetric, but have a tail on the right. Other measures of location are sometimes used. Minitab uses the trimmed mean the largest and smallest 5% of values are removed and the mean of the remaining 90% is calculated. For frequency tables, the mode is sometimes used; this is the most frequent category (or group). 10

11 Measures of Spread (Rees ) It is important to compute a measure of spread as well as an average. In other words, a measure of whether the data values are spread out or are bunched together. The simplest measure is: Range = Maximum Minimum This is not very satisfactory because it tends to increase as the sample size increases. Another possible measure of spread is to take the (positive) distance d i of each point x i from the mean, i.e. d i = x i x. The mean of these distances could be used, but this turns out to be awkward to work with, so a different measure is used. Standard deviation (Rees 4.8) 1. Find the squared distance between x i and x d 2 i = (x i x) 2 2. Add these up and divide the total by n 1 (instead of n) to find a mean. 3. Undo the squaring s = (xi x) 2 n 1 Notes: 1. Alternative formula (easier to compute). s = 1 n 1 ( x 2 i ( ) x i ) 2 n 2. The divisor n 1 is used because estimating x uses up one of the pieces of information that we have. On the very rare occasions that the population mean is known exactly, the divisor n is used. This is called the population standard deviation to distinguish it from the sample standard deviation, which is the usual form. 3. Many calculators provide a quick way to find a sample mean and standard deviation. There will often be a button labelled σ n 1 4. The standard deviation is in the same units as the data. E.g. if x i is measured in cm, then s is measured in cm. 5. The larger the value of s, the more the data is spread out about the mean. It is analogous to the moment of inertia in physics. 6. The square of the standard deviation is called the Variance. 11

12 Example: 3 data values 2, 3, 7 n = 3 x = = 4 3 (x i x) = 2, 1, 3 n (x i x) 2 = = 14 i=1 Sample variance = 14 = 7 (3 1) Alternatively n x i = = 12 i=1 n x 2 i = = 62 i=1 ( Variance = 1 (3 1) ) = 1 (62 48) = 7 2 So sample Standard Deviation = 7 = The Greek letter σ (sigma) is often used to denote standard deviation. Calculators often have keys labelled σ n and σ n 1 The latter is the one to use for the sample S.D. Example: Failure times n = 14 xi = 1138 x 2 i = = s = 1 13 ( = 13 = 79.2 hours ) More Standard Deviation Examples: In each case, the mean is 5.0 Sample data: 5, 5, 5 Standard Deviation = 0.0 Sample data: 4, 5, 6 Standard Deviation = 1.0 Sample data: 1, 5, 9 Standard Deviation = 4.0 Sample data: 1, 1, 5, 5, 9, 9 Standard Deviation =

13 Sample data: 1, 2, 3, 4, 5, 6, 7, 8, 9 Standard Deviation = 2.74 An earlier example showed that when groups are combined, the overall mean is obtained by adding totals rather than averaging means. To obtain an overall standard deviation, it is necessary to add up the sums of squared values. Example: Data: 1, 2, 3 Total SS = 14 S.D. = 1.00 Data: 4, 6 Total SS = 52 S.D. = 1.41 Data: 1, 2, 3, 4, 6 Total SS = 66 S.D. = 1.92 The overall s.d. = ( /5)/4 = 1.92 Note that the abbreviations SS and S.D. are often used for the Sums of Squares and for Standard Deviation respectively. It is possible to calculate the total and total sums of squares from the group sample size, mean and standard deviation. This is done by reversing the calculations above. For example: x = nx Beware of rounding errors if you do this. Standard deviation from a frequency table It is possible to enter all the data directly, but it is usually best to use the table by adding extra columns. Example: Number of defectives Number Freq. x f fx fx Totals Thus: n = 40 x = 65 x 2 =

14 So: x = = s.d. = 1 (40 1) Accuracy ( ) = 1.53 It is rarely useful to report the value of a standard deviation to more than 3 significant figures. So: Report as 1.23 or possibly and as 98.8 A mean should usually be reported to the same number of decimal places as the corresponding standard deviation, or to one less decimal place. So for the two standard deviations above: A mean of might be given as 5.43 and a mean of might be given as 543. Notes 1. It is important to not round off numbers too early, especially when finding standard deviations. 2. If reported values are used in other calculations, the accurate values should be used, rather than the rounded values. Rounding If a number is exactly 2.345, there is no universally agreed way to round to 2 d.p. A good rule is to round to the nearest even number. So: Inter-quartile range (Rees 4.9) If the data is skew, the inter-quartile range may be a better measure of spread than the standard deviation. The lower quartile Q1 is value such that a quarter of the sample takes values less than Q1. How do we calculate it? If we have n data points arranged in ascending order then Q1 is the ( ) n+1 4 st observation. The upper quartile Q3 is value such that a quarter of sample takes values greater than Q3. How do we calculate it? If we have n data points arranged in ascending order then Q3 is the ( ) ( ) 3(n+1) 4 st observation. Equivalently, it is the n+1 4 st observation when counting down 14

15 from the largest value. The inter-quartile range (IQR) is defined to be IQR = Q3 Q1. Note: This is the method of calculation used by Minitab. Some books use slightly different ways of estimating the quartiles. You may also encounter Deciles and Percentiles; these divide data into tenths and hundredths. Example: Failure times n=14 So Q1 is the n+1 4 = = 3.75th observation. Q3 is the 3(n+1) 4 = 3(14+1) 4 = 11.25th observation. Data: 4, 5, 10, 28, 37, 45, 55, 75, 76, 82, 102, 139, 197, 283 3rd observation = 10 and 4th observation = 28 So: Q1 = 10 + (0.75)(28 10) = = 23.5 hours 11th observation = 102 & 12th observation = 139 So: Q3 = (0.25)( ) = = hours The inter-quartile range IQR = = hours. Note: A stem-and-leaf plot presents data in a sorted form, so can be used to find the median and quartiles. However, the resulting values may be rather too small, because the plot sometimes truncates numbers. Box Plot The quartiles can be used to create a display of the data called a box-and-whisker plot or box plot. The box is formed from the quartiles and the whiskers connect the box to the maximum and the minimum. 15

16 Min LQ Median UQ Max If the data are skew, the median will not be near the middle of the box, and one whisker will be much longer than the other. The values used in drawing a boxplot are called a five number summary. Example: Failure times The five number summary is {4, 23.5, 65, , 283} 16

17 Properties of Mean and Standerd Deviation If data are roughly symmetrical about mean, then: Approximately 2 will be within 1 s.d. of the mean 3 Approximately 95% will be within 2 s.d. of mean Usually all will be within 3 s.d. of the mean. The Inter-Quartile Range will be approximately 1.35 standard deviations. Standard Scores An individual data point could be considered to be extreme if it is several standard deviations away from the mean. The standard score (z-score, standardised value) for x is: z = x x s This measures how many standard deviations x is above or below the mean. Example: Failure times Recall that x = 81.3 and s = This suggests that 283 might be unusual. = 0.33 small z (typical) = 2.55 large z (extreme) Change of Scale If a constant is added to all of the data values, the Mean is increased by the same constant; the S.D. is unchanged. If all the data values are multiplied by a constant, the Mean and S.D. are both multiplied by the same constant. Example: Temperature conversion from C to F. Need to multiply by 1.8 and add 32. Celsius Mean = 15 C and S.D. = 5.5 C Fahrenheit Mean = = 59 F Fahrenheit S.D. = = 9.9 F 17

1. Exploratory Data Analysis

1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be