Lecture 1: Description of Data Readings: Sections 1.,.1-.3 1 Variable Example 1 a. Write two complete and grammatically correct sentences, explaining your primary reason for taking this course and then describing what the term statistics means to you. b. For each word in your response to part a, record the number of letters in the word: c. Did every word in your two sentences contain the same number of letters? Definition A variable is any characteristic of a person or thing that can be assigned a number or a category. The person or things to which the number or category is assigned, such as a student in your class, is called the observational unit. Data consist of the numbers or categories recorded for the observational units in a study. Variability refers to the phenomenon of a variable taking on different values or categories from observational unit to observational unit. A quantitative variable measures a numerical characteristic such as height, where a categorical records a group designation such as gender. Example Now consider the students in your class as observational units. Classify each of the following variables as categorical or quantitative. How many hours you have slept in the past 4 hours Whether you have slept for at least 7 hours in the past 4 hours How many states you have visited Handedness (which hand you write with) Day of the week on which you were born Gender Average study time per week Score on the first exam in this course 1
Still consider yourself and your classmates as observational units, can average height of students in the class be legitimately considered a variable? What about percentage of students in the class who have used a cell phone today? Explain. Example 3 Suppose that the observational units of interest are the fifty states. Identify which of the following are variables and which are not. Also classify the variables as categorical or quantitative. Gender of the state s current governor Number of states that have a female governor Percentage of the state s residents older than 65 years of age Highest speed limit in the state Whether the state s name contains one word Average income of the adult residents of the state How many states were settled before 1865 Example 4 For each of the following questions, identify the observational units and variables. Also classify each variable as type quantitative or categorical a. An economist suspects that chief executive officers (CEOs) of American companies tend to be taller than the national average height of 69 inches, so she takes a random sample of 100 CEOs and records their height. Observational units: Variable (Type): b. A conservationist recorded the whether (clear, partly cloudy, cloudy, rainy) and number of cars parked at noon at a trail head on each of 18 days. Observational units: Variable (Type):
c. A psychologist shows a videotaped interview of a married couple to a sample of 150 marriage counsellor. Each counsellor is asked to predict whether the couple will still be married five years later. The psychologist wants to test whether marriage counsellors make the correct prediction more than half the time. Observational units: Variable (Type): d. A psychologist gives an SAT-like exam to 00 African-American college students. Half of the students are randomly assigned to use a version of the exam that asks them to indicate their race, and the other half are randomly assigned to use a version of the exam that does not ask them to indicate their race. The psychologist suspects that those students who are not asked to indicate their race will score significantly higher on the exam than those who are asked to indicate their race. Observational units: Variable (Type): e. An economist randomly assigns four actors to go to ten different car dealerships each and negotiate the best price they can for a particular model of car. The four people are all the same age, dressed similarly, and tell the car sale people that they have the occupation and neighbourhood of residence. One of the actors is a white male, one is a black male, one is black male, one is a white female, and one is a black female. The economist wants to test whether the average prices differ significantly among these four types of customers. Observational units: Variable (Type): Wrap up... You encountered the most fundamental concept of statistics: variability. This concept will be central throughout the course. Some useful definitions to remember and habits to develop from this topic include Always consider data in context and anticipate reasonable values for the data collected and analyzed. A variable is a characteristic that varies from person to person from thing to thing. The person or thing is called an observational unit. Variables can be classified as categorical or quantitative, depending on whether the characteristic is a categorical designation (such as gender) or a numerical value (such as height). 3
Visualizing Data.1 Frequency Table and Histogram Example 5. (Binge Drinking in College). Binge drinkers: Five or more drinks in a row for males, four or more drinks in a row for females. Population: undergraduate students Sample: a sample of students in (a sample of) 30 colleges Variable: percentage of undergraduate students who are binge drinkers in a college Data: 46 5 51 35 58 60 59 46 33 57 55 1 48 36 13 7 58 64 46 67 4 53 6 41 9 6 18 66 41 6 Frequency distribution: a way to summarize data by displaying the number of times (frequency) or proportion of times (relative frequency) each value occurs in the data set. Class Index Class Interval Frequency Relative Frequency 1 [10, 0) 0.067 [0, 30) 5 0.167 3 [30, 40) 3 0.1 4 [40, 50) 7 0.33 5 [50, 60) 8 0.67 6 [60, 70) 5 0.167 Frequency Histogram Relative Frequency Histogram 8 0.05 Frequency 6 4 Relative Frequency 0.00 0.015 0.010 0.005 0 0.000 10 0 30 40 50 60 70 10 0 30 40 50 60 70 Three steps to create a histogram: 1. Group observations into classes and create the frequency table (classes are also called bins). Mark the class boundaries on a horizontal measurement axis 3. Above each class interval, draw a rectangle whose height is frequency or relative frequency How many classes? 4
Not too many, not too few Too many classes Too few classes 3.0.5 1 10 Frequency.0 1.5 1.0 Frequency 8 6 4 0.5 0.0 0 10 0 30 40 50 60 70 10 0 30 40 50 60 70 Use 5 to 15 classes for moderate sample size (n = 50); more classes may be used if sample size is larger. A reasonable rule of thumb is number of classes sample size Histogram with unequal width: rectangle height = relative frequency class width Frequency Histogram Frequency Histogram 8 0.05 6 0.00 Frequency 4 Density 0.015 0.010 0.005 0 0.000 0 40 60 80 100 0 40 60 80 100 Bar chart for categorical data - an analogue to histogram Example 6. Motorcycle Monthly was interested in the types of motorcycles their readers ride. 10 subscribers were randomly selected to be surveyed. Here are their responses Pareto diagram: Manufacturer Frequency Relative Frequency Honda 41 0.34 Yamaha 7 0.3 Kawasaki 0 0.17 Harley-Davidson 18 0.15 BMW 3 0.03 Other 11 0.09 Categories appear in order of decreasing frequency, except for the last miscellaneous class. 5
45 40 35 30 5 0 15 10 5 0 Honda Yamaha Kawasaki Harley Davidson BMW Other. Shapes of Distributions Unimodal, bimodal, or multimodal? Symmetric or skewed? Positively/right skewed, or negatively/left skewed? Symmetric Bimodal Positively Skewed Negatively Skewed 3 Numerical Summary of Data 3.1 Measures of center Sample mean 6
x = x 1 + x +... + x n = 1 n n n x i = 1 xi n Example: observations 6, 5, 7, 7, 6 (The sample mean is x = 31/5 = 6.) Sample median if n is odd, sample median is the middle ordered value: ( ) th n + 1 x = ordered value if n is even, sample median is the average of the two middle ordered values: x = average of ( n ) th and ( n + 1 ) th ordered value Example: observations 7, 9, 10, 1, 14 (The sample median is 10) Example: observations 3, 4, 9, 1, 14, 19 (The sample median is 10.5) If the histogram is fairly symmetric, the sample mean and sample median will be similar Sample mean is more sensitive to outliers (extreme values) than is the sample median Data x x 1,, 3, 4, 5 3 3 1,, 3, 4, 90 0 3 Trimmed mean: compromise between mean and median (semi-sensitive to extreme values) Example 7. n = 0 observations of lifetime (in hours) of an incandescent lamp 61 63 666 744 883 898 964 970 983 1003 1016 10 109 1058 1085 1088 11 1135 1197 101 10% trimmed mean: drop the smallest 10% and largest 10% of the observations and average the rest (10% trimmed mean is 979.15) 0% trimmed mean: drop the smallest 0% and largest 0% of the observations and average the rest (0% trimmed mean is 999.9167) 3. Measures of variability Motivation: Means and medians do not give a full picture Example: Midterm scores of students from two sections of a STAT course 7
0.0 0.15 0.10 0.05 0.00 0.06 0.05 0.04 0.03 0.0 0.01 0.00 50 60 70 80 90 100 50 60 70 80 90 100 Example 8: Three groups of data with 9 observations each Group 1 3 4 5 6 7 8 9 A 30 35 40 45 50 55 60 65 70 B 30 44 46 48 50 5 54 56 70 C 46 47 48 49 50 51 5 53 54 The three groups have the same mean and median. But there is clearly a difference. Which group appears to be more variable? Which is less variable? Sample range: The difference between the largest and smallest observation. Group Sample range A 40 B 40 C 8 Sample variance and sample standard deviation: 1. Deviations from the mean: difference between an observation x i and the mean x Group Deviations from the mean A -0-15 -10-5 0 5 10 15 0 B -0-6 -4-0 4 6 0 C -4-3 - -1 0 1 3 4. Sample variance: s = n (x i x) n 1 = S xx n 1 3. Sample standard deviation: s = s Example 8 (cont d): Group Squared Deviations from the mean S xx s s A 400 5 100 5 0 5 100 5 400 1500 187.5 13.693 B 400 36 16 4 0 4 16 36 400 91 114 10.677 C 16 9 4 1 0 1 4 9 16 60 7.5.739 4. An alternative formula for sample variance 8
Sum of Squares S xx = Sample variance s = Example 8 (cont d): n (x i x) = n x i np n 1 «x i n n x i ( n ) x i n i 1 3 4 5 6 7 8 9 x i 46 47 48 49 50 51 5 53 54 x i 116 09 304 401 500 601 704 809 916 np x i = 450 np x i = 560 450 s 560 9 = = 7.5 9 1 Interquartile Range Quartiles: Lower quartiles (LQ or Q1): Median of the lower half of the data values 5% of observations are smaller than this value Upper quartiles (UQ or Q3): Median of the upper half of the data values 75% of observations are smaller than this value If sample size n is an odd number, the median is included in both halves. There is a difference in how quartiles are defined in different books and softwares. You are expected to do it using the method given above! Example: 1,, 3, 4, 5 Median = 3 Lower quartile = Upper quartile = 4 Example: 1,, 3, 4, 5, 6 Median =3.5 Lower quartile = Upper quartile = 4 Interquartiles Range (IQR): difference between the upper and lower quartile (UQ - LQ) Outliers: observations farther than 1.5IQR from the closest quartile. Extreme outliers: observations farther than 3IQR from the closest quartile. Example: 1,, 3, 4, 5, 6, 11 Median = 4, LQ =.5, UQ = 5.5, IQR = 3, [LQ-1.5IQR, UQ+1.5IQR] = [-, 10] 9
4 Five-number summary and boxplot Five-number summary: Min, Lower quartile, Median, Upper Quartile, Max Boxplot: Max Upper quartile Median Lower quartile Min Boxplot that shows the outliers: Max Max non outlier Upper quartile Median Lower quartile Min non outlier Min Example 7. (cont d) n = 0 observations of lifetime (in hours) of an incadescent lamp Min = 61, Max = 101 61 63 666 744 883 898 964 970 983 1003 1016 10 109 1058 1085 1088 11 1135 1197 101 10
Median = 1009.5 Lower quartile = 890.5 Upper quartile = 1086.5 IQR = 196 Outliers? [LQ 1.5IQR, UQ + 1.5IQR] = [596.5, 1380.5] (No outliers) Lamp lifetime data Lamp lifetime data with one added observation 300 100 100 1000 1000 800 800 600 600 400 400 00 00 Side-by-side boxplot: helpful to compare distributions of data with multiple groups: Group 1 Group 11