Quantitative Tools for Research KASHIF QADRI Descriptive Analysis Lecture Week 4 1
Overview Measurement of Central Tendency / Location Mean, Median & Mode Quantiles (Quartiles, Deciles, Percentiles) Measurement of Dispersion Range & Quartile Deviation Variance and Standard Deviation The concept of outliers 3 Descriptive Analysis Describing the characteristics of the data Revealing the distribution Leading towards further analysis Concluding on the basis descriptive analysis 4 2
Measures of Central Tendency Central Tendency All the values in a data tend towards it center. 3
Central Tendency The values, in a data set, tend towards a central value that is called central tendency. It summarizes a data set in a single value. The methods to measure central tendency are called Measures of Central Tendency, Location, Position or simply Average. 7 The Arithmetic Mean x The Arithmetic Mean or Simply the mean is the most widely used average. It is defined as the sum of the observations divided by the number of the observations. It is indicated by AM or μ or x 8 4
The Arithmetic Mean Mean = Sum of observations No. of observations Let the observations are x1, x2, x3,, xn. Then the arithmetic mean (μ or ) will be: x = x1 + x2 + x3 + + xn = Σxi x n n μ = x = Population Mean Sample Mean 9 Example Find the AM of: 4, 7, -2, 0 and 8. Mean = ( 4 + 7-2 + 0 + 8 ) / 5 = 17 / 5 = 3.4 Ans. 10 5
Weighted Mean According the relative importance of numbers their weights are assigned. Hence weighted mean is obtained as: WM = w1*x1 +w2*x2 + + wk*xk w1 + w2 + + wk WM = Σwx 11 Σw Example A student scored 45, 80 and 60 in three quizzes. The weights of these quizzes are 1, 2 and 5 respectively. Find the weighted score of this student. 12 6
Solution Weighted Mean = w1*x1 + w2*x2 + w3*x3 w1 + w2 + w3 WM = 1* 45 + 2 * 80 + 5 * 60 1 + 2 + 5 WM = 505 8 Weighted Mean = 63.125 13 14 7
Combined Mean The combined mean of k groups can be obtained by: CM = n1*m1 + n2*m2 + + nk*mk n1 + n2 + + nk CM = Σnm Σn 15 Example The mean of three samples are given below. Find the combined mean. Sample # No. of values Mean 1 32 1158 2 17 1897 3 26 1453 16 8
Solution CM = n1*m1 + n2*m2 + n3*m3 n1 + n2 + n3 CM = 32 * 1158 + 17 * 1897 + 26 * 1453 32 + 17 + 26 = 37056 + 32249 + 37778 75 = 107083 / 75 Combined mean = 1427.77 17 Properties of Arithmetic Mean 1. The sum of deviations of all observations from AM is always zero. 2. The sum of square of deviations of each observation from AM is minimum. 3. Linear Transformation 4. Change of Origin & Scale 18 9
Linear Transformation If there is a linear relationship between two variables X andy, i.e. Y= a + bx where a & b are any constants but a 0. Then y = a + b x 19 Example The average salary of workers in a factory is $580. If their salary is raised by 2% and a further bonus $50 is given to each, then find the new average salary. X = 580, a = 1.02, b = 50 ; Y =? Formula: Mean of Y = a + b * Mean of x = 50 + 1.02 * 580 = $641.6 20 10
The Median Median is the value which divides the ordered data set into two equal parts (or it is middle most observation in the ordered data). Data line: Median 0% 50% 100% 27 Median Median is the middle most observation in the arranged data. Median divides the ordered data into two halves. 50% values lie below median and 50% above median. 28 Median 14
For Odd number of observations Formula: Median = (n + 1)th observation 2 Find the median of: 2, - 6, 0, 11, 7, 5, and - 1 Median = (n + 1) = (7 + 1) = 4th observation 2 2 Arranged data: - 6, - 1, 0, 2, 5, 7, 11 Median = 2 29 Question Find the median of the following data: 17, -13, 21, 9, 0, -8, 13, 7, 2 Solution: - 13, - 8, 0, 2, 7, 9, 13, 17, 21 Median = 7 30 15
For Even number of observations Formula: Median is AM of { n/2 & (n/2 + 1)}th obs. Find the median of : 4, -6, 0, 7, 4, 2, -9, 10 Here n = 8 So median will be: AM of ( 4th & 5th) observations i.e. Ordered data: -9, -6, 0, 2, 4, 4, 7, 10 Median = (2 + 4)/2 = 3 31 Question Find the median of the following data: 45, 10, 36, 28, 17, 32, 11, 37, 22, 41 Ordered Data: 10, 11, 17, 22, 28, 32, 36, 37, 41, 45 Median = (28 + 32)/2 = 30 32 16
Quantiles 1. The partitioned values of the ordered data set are called quantiles. 2. There are three kinds of quantiles: a) Quartiles (divide the into 4 parts) b) Deciles (divide the data into 10 parts) c) Percentiles (divide the data into 100 parts) 33 Quartiles Q1 Q2 Q3 Q1 = Lower Quartile Q3 = Upper Quartile Q2 = Median 34 17
Quartiles There are three quartiles Q1, Q2, and Q3. Q1 is the value below which 25% data lies. Q2 is the value below which 50% data lies. Q3 is the value below which 75% data lies. Q1 = (n +1)th obs. & Q3 = 3(n +1)th obs. 4 4 35 Example Find the lower and upper quartiles of the data given below. 4, 3, 9, 0, 1, 6, 8, 4, 3, 0, 2, 10, 13. Arranging in ascending Order (n = 13): 0 0 1 2 3 3 4 4 6 8 9 10 13 36 18
Solution Lower Quartile (Q1) Q1 = (13+1)/4 = 3.5 = 4th Obs. Q1 = 2 Upper Quartile (Q3) Q3 = 3 (13 +1)/4 = 10.5 = 11th Obs. Q3 = 9 37 Deciles Deciles divide the ordered data set into TEN parts. D1 D2 D5 D9 10 20 50 90 General Formula: Di = i*(n +1)/10 38 19
Example Find 4th decile of the following series: 3, 6, 9,, 884, 887, 900 4th Decile (D4): Here n = 300 ( i.e. 300 observations ) D4 = 4(n +1)/10 value D4 = 4(301)10 = 120.4 = 120 (round off) Hence D4 will be 120th Obs. i.e. D4 = 360 39 Percentiles P1 P50 P99 1 2 50% 99% Partition the ordered data set into 100 parts. General Formula: Pi = i*(n + 1)/100 40 20
Example Find 78th percentile of the following series: 3, 6, 9,, 884, 887, 900 78th Percentile P78 = 78(n+1)/100 = 78(301)/100 = 234.78 = 235 P78 = 235th Observation. P78 = 705 41 The Mode Mode is the most frequent observation in the data. It is the value which is repeated largest number of times. If two values are repeated same number of times both of them will be mode. If all the values are repeated equal number of times than there will be no mode. 42 21
Example Find the mode of: a) 1, 2, 1, 3 2, 3, 0, 1, 4, 5, 2, 3, 3 Mode = 3 b) 2, 2, 1, 3, 5, 0, 5, 0, 0, 4, 1, 6, 1 Modes = 0 and 1 Why? (Both are repeated equal number of times) c) 1, 2, 3, 5, 4, 7, 0, 6, 8, 9, 15 Mode: none Why? (All values are once in the data.) 43 When to Use? Mean Mean most generally used central tendency Popular measure of central tendency because of it properties Fails if observations are scattered or extreme values are there Median Strong measurement in ordered data Good for qualitative ordered data Has no affect of extreme values Lacks mathematical properties Mode Helpful if the values are close and repeated Lack mathematical treatment 44 22
Measurement of Dispersion Definition If the values in a data set are scattered apart much than simply central tendency will not describe the data adequately. Hence the measure of spread is also applied to the data. The additional information that measures the scattered nature of a data set is called dispersion. 23
It is useful especially when two data sets are to be compared. There are two types of dispersion: absolute and relative. The dispersion is expressed in the units same as the data. The Range (R) Measure of Variation Difference Between Largest & Smallest Observations: Range = xm x 0 Ignores How Data Are Distributed: Range = 12-7 = 5 Range = 12-7 = 5 7 8 9 10 11 12 7 8 9 10 11 12 24
Formula Range = Xm - Xo Coefficient of Range = Xm - X0 Xm + X0 Example Find Range and Coefficient of Range of: 2, 5, 6, 10, -4, -3, 0, 5, 11 Here: Xm = 11 Xo = - 4 R = 11 - ( - 4) = 15 Co eff. of R = 11 - ( - 4 ) = 15 = 2.14 11 + ( - 4 ) 7 25
Let us take two sets of observations. Set A contains marks of five students in Mathematics out of 25 marks and group B contains marks of the same student in English out of 100 marks. Set A: 10, 15, 18, 20, 20 Set B: 30, 35, 40, 45, 50 The values of range and coefficient of range are calculated as: Range Coefficient of Range Set A: (Mathematics) 20-10 =10 20-10/20+10=0.33 Set B: (English) 50-30 =20 50-30/50+30=0.25
Quartile Deviation Quartile Deviation (QD) is the half of the difference between upper and lower quartile. Quartile deviation is also called semi interquartile range. Formula QD = Q3 - Q1 2 26
Example The students of a class scored the following marks in a certain quiz. Find the quartile deviation and coefficient of quartile deviation of marks. 40, 12, 27, 11, 5, 33, 45, 21, 37, & 43 Ordered data: 5, 11, 12, 21, 27, 33, 37, 40, 43, 45 Solution No. of observations = n = 10 Q1 = (n + 1)/4 = 11/4 = 2.75 3rd Obs. Q3 = 3(n + 1)/4 = 33/4 = 8.25 8th Obs. So Q1 = 12 Q3 = 40 QD = Q3 - Q1 = 40-12 = 14 2 2 27
Variance It is an important Measure of Variation. It shows variation about the arithmetic mean. For the Population: For the Sample: For the Population: use N in For the Sample : use n - 1 the denominator. in the denominator. Standard Deviation Most Important Measure of Variation Shows Variation About the Mean: For the Population: For the Sample: s Xi X 2 n 1 Xi N 2 For the Population: use N in the denominator. For the Sample : use n - 1 in the denominator. 28
Example Find the variance and standard deviation of the following data. - 4, 0, 6, 10 and 23 a) Use actual formula b) Short-cut formula Solution: X (X - m) (X - m)2 X X2-4 -4 16 0 0 0 6 6 36 10 10 100 23 23 529 35 681 m = mean = 7 30
Calculations Actual m = ΣX/n = 35/5 = 7 S2 = Σ(X - m)2 n S2 = 436 / 5 S2 = 87.2 S = 9.3381 Solution: X X2 31
Calculations Actual Short - Cut X= ΣX/n = 35/5 = 7 S2 = ΣX2 - ( ΣX )2 S2 = (X - X)2 n n n S2 = 681 - ( 35 )2 S2 = 436 / 5 5 5 S2 = 87.2 S2 = 136.2 - ( 7 )2 S = 9.3381 S2 = 87.2 S = 9.3381 CoComparison of three data setseviations Data A 11 12 13 14 15 Mean = 15.5 16 17 18 19 20 21 s = 3.338 Data B Data C 11 12 13 14 15 16 17 18 19 20 21 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 s =.9258 Mean = 15.5 s = 4.57 32
Coefficient of Variation of Variation Measure of Relative Variation Always a % Shows Variation Relative to Mean Used to Compare 2 or More Groups Formula (for sample): Example CV = SD x 100 Mean CV = 9.3381 x 100 = 133.4 % 7 In comparing two data sets, the data set having less CV is considered more consistent. 33
Comparing CVs Stock A: Average Price last year = $50 Standard Deviation = $5 Stock B: Average Price last year = $100 Standard Deviation = $5 Coefficient of Variation: Stock A: CV = 10% Stock B: CV = 5% Example: The units produced by two worker in last week are given below. Which worker is more consistent in terms of production? Day Mon Tue Wed Thu Fri Worker A 26 55 0 112 51 Worker B 45 36 68 22 57 34
Solution Worker A Worker B X X2 (square) X X2(square) 26 676 45 2025 55 3025 36 1296 0 0 68 4624 112 12544 22 484 51 2601 57 3249 244 18846 228 11678 Calculations Worker A Mean Mean = ΣX / n = ΣX / n = 244 / 5 = 228 / 5 = 48.8 = 45.6 Worker B 35
Worker A Worker B Variance Variance S2 = ΣX2 - ( ΣX )2 S2 = ΣX2 - ( ΣX )2 n n n n S2 = 18846 - (244)2 S2 = 11678 - (228)2 5 5 5 5 S2 =3769.5-2381.44 S2 = 2335.6-2079.36 S2 = 1388.06 S2 = 256.24 Worker A Standard Deviation S = 1388.06 = 37.26 Co-eff. of Variance CV = S * 100 X CV = 37.26 * 100 48.8 CV = 76.35% Worker B Standard Deviation S = 256.24 = 16.01 Co-eff. of Variance CV = S * 100 X CV = 16.01 * 100 45.6 CV = 35.11% 36
Summary of Results Worker A Worker B Mean 48.8 45.6 Variance (S2) 1388.06 256.04 SD (S) 37.26 16.01 CV 76.35% 35.11% Result: Here CVB < CVA hence worker B is more consistent in his performance. Inter Quartile Range IQR = Q3 - Q1 The quartiles of a data set are Q1 = 25 Q3 = 38 Find IQR = 38-25 = 13 37
Concept of Outlier Extreme values in a data set are called outliers It is difficult to identify sometimes outlier especially when the data is large Our results are affected because of outliers Hence we detect outliers and remove them and then perform out anslysis 75 How to find outliers? Lower limit and upper limit of a data set are: Lower limit = Q1-1.5 IQR Upper limit = Q1 + 1.5 IQR 76 38
Example Find the outlier in the following data if any: Following is the weekly TV viewing time (in hours) of 20 people. 25 41 27 32 43 66 35 31 15 5 34 26 32 38 16 30 38 30 20 21 Q1 = 23 Q3 = 36.5 IQR = 36.5-23 = 13.5 Lower limit = Q1-1.5 IQR = 23-1.5*13.5 = 2.75 hours Upper limit = Q3 + 1.5 IQR = 36.5 + 1.5*13.5 = 56.75 77 There is only one outlier = 66 Weekly time 66 hours is outside the usual pattern of the data. 78 39