Exploring Data. How to Explore Data

Size: px

Start display at page:

Download "Exploring Data. How to Explore Data"

Deirdre Parker
5 years ago
Views:

1 Eploring Data Statistics is the art and science of learning from data. This may include: Designing appropriate tools to collect data. Organizing data in a meaningful way. o Displaying data with appropriate graphs. o Summarizing data with numbers. Using data to draw conclusions and make predictions. Data are information in contet. Individuals are the objects described by a set of data. They may be people, animals, or things. A variable is any attribute that can take different values for different individuals. A categorical (or qualitative) variable assigns labels that place each individual into a particular group or category. A quantitative variable takes number values that are quantities counts or measurements for which it makes sense to find an average. Not every variable with a number value is quantitative! Eamples: zip codes, ID numbers, grade levels (sometimes) The distribution of a variable shows what values the variable takes and how often it takes each value. Distributions are summarized in tables and displayed in graphs. How to Eplore Data Begin by eamining each variable by itself. Then move on to study relationships among the variables. Start with a graph or graphs. Then add numerical summaries. Eample: The following table shows information about several popular cell phone models. Phone Operating System Screen Size (inches) Internal Storage (GB) Epandable Storage Rear Camera (megapiels) Battery Life (Talk Time) (hours) Apple iphone 6S Plus ios No 1 4 Apple iphone 6s ios No 1 14 Apple iphone 6 ios No 8 14 BlackBerry DTEK 5 Android Yes BlackBerry Priv Android Yes 18 4 BlackBerry Leap BlackBerry Yes 8 5 LG X Skin Android Yes 8 7 LG G5 SE Android Yes 16 LG G5 Android Yes 16 Microsoft Lumia 65 Windows Yes 8 13 Microsoft Lumia 95 Windows Yes 13 Microsoft Lumia 95 XL Windows Yes 19 Samsung Galay Note 7 Android Yes 1 4 Samsung Galay On 7 Pro Android Yes Samsung Galay S7 Edge Android Yes 1 33

2 a) Who/what are the individuals in this data set? Cell phone models b) What variables are measured? Identify each as categorical or quantitative. In what units were the quantitative variables measured? Operating system (categorical), screen size (quantitative inches), amount of internal storage (quantitative GB), whether or not it has epandable memory (categorical), rear camera resolution (quantitative megapiels), battery life (quantitative hours) c) Give the distributions of the following for the data set: screen size, internal storage, and presence of epandable memory. 4.6 Screen Size (inches) Internal Storage (GB) Epandable Memory? Yes No Screen Size (in) Count Internal Storage (GB) 64 Analyzing Categorical Data Epandable Memory? No % Yes 8% The values of a categorical variable are labels for the categories, such as male or female. The distribution of a categorical variable gives the categories and either the count or proportion of individuals who fall into each category. Proportion: The fraction of the total that possesses a certain attribute. Proportions can be epressed as fractions, decimals, or percentages. Frequency: The number (count) of individuals in each category. Relative Frequency: The proportion of individuals in each category. Often, we organize categorical data into either a frequency table or a relative frequency table. (These are sometimes called frequency distributions and relative frequency distributions.) Eample: The following is a frequency table showing the distribution of responses to the question, How do you eat corn on the cob? Find the relative frequency distribution. How do you eat corn on the cob? Frequency Relative Frequency In rows 8 8/ In circles 4 4/41.98 Bite wherever 5 5/41.1 I don t eat corn on the cob /41.49 Cut the corn off the cobb /41.49 Total 41

3 Categorical data is often displayed using bar graphs, pie charts, and segmented pie charts. A bar graph shows each category as a bar. The heights of the bars correspond to the frequencies or relative frequencies of the categories. A pie chart shows each category as a sector or slice of a circle or pie. The areas of the slices are proportional to the category frequencies or relative frequencies. A segmented bar graph displays the distribution of a categorical variable as a single bar divided into segments. The height of each segment corresponds to the proportion of individuals in the category it represents. Segmented bar graphs use relative frequencies on the vertical ais. Bar Graph Procedure: 1. Draw and label the aes. Put the name of the categorical variable under the horizontal ais. To the left of the vertical ais, indicate whether the graph shows the frequency (count) or relative frequency (proportion) of individuals in each category.. Scale the aes. Write the names of the categories at equally spaced intervals under the horizontal ais. On the vertical ais, start at, and place tick marks at equal intervals until you eceed the highest frequency or relative frequency of any category. 3. Draw bars above the category names. Make sure the bars are equal in width and leave gaps between them. The height of each bar should correspond to the frequency or relative frequency of the individuals in that category. Pie Chart Procedure: 1. Draw a circle to represent the entire data set.. Calculate the size of the central angle for each slice : slice size= 36 relative frequency of category 3. Divide the circle into slices with the appropriate central angles. Use a protractor (or computer) to do this. 4. Label the slices appropriately! Eample: Draw a well-labeled bar graph and a well-labeled segmented bar graph of the corn data from the previous eample. Relative Frequency Rows Circles Bite Don't Eat Cut Off Wherever Method of Corn Eating 1% 8% 6% 4% % % Method of Corn Eating Cut Off Don't Eat Bite Wherever Circles Rows

Bar graphs can be used in more situations than pie charts and segmented bar graphs!

o Bar graphs can compare proportions of different groups who share some trait.

A pie chart or segmented bar graph couldn t show this, because these proportions are parts of the same whole.

For eample, what percent of students like pizza, what percent like spaghetti, and what percent like pancakes?

This data couldn t be displayed on a pie chart or segmented bar graph, but could still be displayed on a bar graph.

For eample, we might know what category some of the individuals fall into, but not others.

4 Bar graphs can be used in more situations than pie charts and segmented bar graphs! Pie charts and segmented bar graphs can only be used in situations when the data includes all parts of a single whole! o Bar graphs can compare proportions of different groups who share some trait. For eample, what proportions of sophomores, juniors, and seniors approve of Bingham s parking policy? A pie chart or segmented bar graph couldn t show this, because these proportions are parts of the same whole. o Bar graphs can compare proportions in cases where individuals might fall into multiple categories. For eample, what percent of students like pizza, what percent like spaghetti, and what percent like pancakes? Students could easily fall into multiple categories, so the percentages would add up to more than 1%. This data couldn t be displayed on a pie chart or segmented bar graph, but could still be displayed on a bar graph. o Bar graphs can be used in cases where information is missing. For eample, we might know what category some of the individuals fall into, but not others. To display this kind of data in a pie chart or segmented bar graph, it would be necessary to add an other category. Deceptive Graphs: Watch out for graphs in which the width changes in addition to the height. The eye responds to area, so this makes the graph misleading. This happens a lot in pictographs. Watch out for graphs where the aes don t start at zero (and/or are missing).

5 Watch out for unequally-spaced intervals. Watch out for pie charts or segmented bar graphs where the percentages don t add to 1%. This is a tip-off that they don t represent all the parts of a single whole. Watch out for 3D graphs or graphs set at an angle. This distorts the data. Perception of 3D Pie Charts % 3% % 3% Cool Confusing Misleading Unreadable

6 A two-way table (or contingency table) summarizes the relationship between two categorical variables for some group of individuals. The rows represent values of one variable and the columns represent values of the other variable. A marginal relative frequency gives the percent or proportion of individuals that have a specific value for one categorical variable (ignoring the information about the other variable). It is calculated using the information in a margin of the table and dividing by the overall total number of individuals. A marginal distribution gives the marginal relative frequencies for each of the values of a categorical variable. Eample: AP Statistics students were categorized according to their gender and how they like their bacon cooked. The results are given below. Calculate the marginal distribution of bacon preferences. Draw a graph of the results. Describe what you see. Gender Bacon Preference Female Male Total A Little Limp Crispy Etra Crispy Don t Eat Bacon Total Bacon Preference Marginal Relative Frequency A Little Limp 1/41.44 Crispy 16/41.39 Etra Crispy 7/ Don t Eat Bacon 8/ Total 41 We can also answer questions involving both categorical variables. Relative Frequency A Little Limp Crispy Etra Crispy Bacon Preference Don't Eat Bacon The most popular way to eat bacon is crispy. About 39% of the students in the sample like their bacon this way. About 4% of students like their bacon a little limp, making this the second-most popular way to eat bacon. About % of students don t like bacon. The least popular way to eat bacon is etra crispy. Only 17.1% of students like their bacon etra crispy. A joint relative frequency is an and relative frequency. It gives the proportion of individuals that fall in a specific category of one variable and a specific category of another variable. Joint relative frequencies are proportions of the overall total. Eample: What proportion of the students in the sample are males and like their bacon etra crispy? 3/41 7.3% Eample: What percent of students in the sample are females who don t eat bacon? 7/ % To eamine the relationships between variables, we need to calculate some well-chosen proportions from the counts in the table. A conditional relative frequency gives the proportion of individuals with a specific value of one categorical variable among individuals who share a specific value of another categorical variable (the condition). Eample: What percent of the females in the sample like their bacon a little limp? 6/5 = 4% Eample: What proportion of the people who like their bacon crispy are female? 8/16 = 5%

7 Question: Are either of the above conditional relative frequencies misleading? Why? Hearing that 5% of the people who like their bacon crispy are female and 5% are male makes you think that males and females are equally likely to like bacon crispy. However, this is not true because the number of females in the sample is much higher than the number of males. In reality, only 8/5 = 3% of the females like their bacon crispy, while 8/16 = 5% of the males like their bacon crispy. A conditional distribution gives the conditional relative frequencies for each of the values of a categorical variable among individuals with a specific value of another categorical variable. Eample: Using the data above, calculate the conditional distribution of bacon preference for each gender. (This means figure out what proportion of girls like their bacon each way and what proportion of boys like their bacon each way.) Bacon Preference Female Male A Little Limp 6/5 = 4% 4/16 = 5% Crispy 8/5 = 3% 8/16 = 5% Etra Crispy 4/5 = 16% 3/16 = 18.75% Don t Eat Bacon 7/5 = 8% 1/16 = 6.5% 1% 1% To compare the conditional distributions of a categorical variable, we use side-by-side bar graphs (or comparative bar graphs). These display the distribution of a categorical variable for each value of another categorical variable. The bars are grouped together based on the values of one of the categorical variables and multiple distributions are placed side by side. Color-coding or keys are often used. There is an association (or relationship) between two variables if knowing the value of one variable helps us predict the value of the other. If knowing the value of one variable does not help us predict the value of the other, then there is no association between the variables. If the values of one variable are really different for different values of the other variable, then there is an association between the variables. If the values of one variable are really similar for different values of the other variable, then there isn t an association between the variables. Do not use the word correlation when you mean association. Correlation has a very specific meaning in statistics, which we will talk about later in the year. Eample: Draw a side-by-side bar graph comparing the bacon preferences of males and females. Use relative frequencies for the vertical ais. Then draw a segmented bar graph for each gender. Describe what you see. Does there appear to be an association between gender and bacon preference? Eplain..6 1%.5 A Little Limp 8% Don't Eat Bacon.4 6%.3 Crispy Etra Crispy. 4% Etra Crispy Crispy.1 % Don't Eat Bacon % A Little Limp Female Male Female Male There is a definite association between gender and bacon preference. Specifically, females are much more likely than males to not eat bacon. (8% of females don t eat bacon compared to only 6.5% of males). Also, males are more likely than females to prefer their bacon crispy. (5% of males prefer their bacon crispy compared to 3% of females). Similar proportions of males and females prefer their bacon a little limp and etra crispy. Relative Frequency Relative Frequency

Displaying Quantitative Data with Graphs One of the most common parts of a statistical problem is finding an appropriate way to display data.

The most common ways to display quantitative data are dotplots, stemplots, histograms, and boplots.

Point out any outliers (unusually small or unusually large data values). Always put your descriptions in contet! Describing Shape: How many peaks does the distribution have?

o Multimodal: Three or more peaks (groups). If there are any major gaps between groups, describe their locations. Is the distribution approimately symmetric or skewed?

8 Displaying Quantitative Data with Graphs One of the most common parts of a statistical problem is finding an appropriate way to display data. Quantitative data can t be displayed the same way as categorical data (bar graphs and pie charts don t work). The most common ways to display quantitative data are dotplots, stemplots, histograms, and boplots. How to Eamine the Distribution of a Quantitative Variable Describe the overall pattern of a distribution by describing its shape, center, and variation. Point out any outliers (unusually small or unusually large data values). Always put your descriptions in contet! Describing Shape: How many peaks does the distribution have? Don t count minor ups and downs, only major peaks. Ask yourself if there are distinct groups of individuals visible in the graph. o Unimodal: One peak (group). o Bimodal: Two peaks (groups). o Multimodal: Three or more peaks (groups). If there are any major gaps between groups, describe their locations. Is the distribution approimately symmetric or skewed? o If the right and left sides of the graph are close to mirror images of each other, describe the distribution as approimately symmetric. Always use the words approimately or roughly, because in real life, distributions of data are almost never perfectly symmetric. o If the right side of the graph is much longer than the left side (tail to the right), describe the distribution as skewed to the right or skewed to positive values or positively skewed. o If the left side of the graph is much longer than the right side (tail to the left), describe the distribution as skewed to the left or skewed to negative values or negatively skewed. Describing Center: Use the median (middle value) or the mean (average). Describing Variation: Use the range, interquartile range, or standard deviation, or say something like, The [values in contet] vary from a low of to a high of.

9 Dotplots: 1. Draw a horizontal line, label it with the name of the quantitative variable and the units of measurement, and place tick marks at equal intervals.. Locate each value in the data set along the measurement scale and represent it by a dot above the line. If there are two or more observations with the same value, stack the dots vertically. Try to make all the dots the same size and space them out equally as you stack them. To compare two distributions, stack the dotplots on top of each other, using the same scales. Make sure to label the two groups being compared. Eample: Below is a dotplot of the hair lengths of 41 AP Statistics students. Describe the distribution of hair length. Shape: The distribution of hair lengths has multiple peaks. There is one group of students with shorter hair (peak at 7 cm) and another group with longer hair (peak at 41 cm). Tbere are no students with hair between 18 and 6 cm long. Center: The median hair length is 4 cm. (Half of students have hair shorter than 4 cm and half have hair longer than 4 cm). Variability: The hair lengths vary from 1 cm to 68 cm (range = 67 cm). Outliers: There don t appear to be any outliers. Here are parallel dotplots showing the hair lengths of the students sorted by gender. Compare the distributions of hair length for the male and female students. Shape: The distribution of hair lengths for the females is approimately symmetric, while the distribution of hair length for males is slightly skewed to the right, meaning that shorter hair is more common than longer hair for the males. Both distributions have single peaks. There is a peak around 7 cm for the males and a peak around 41 cm for the females. Center: Females in the sample typically have much longer hair (median 47 cm) than the males (median 7 cm). Variability: There is much more variability in hair length for the females than for the males. The hair lengths for the females vary from 9 cm to 68 cm (range 59 cm), while the hair lengths for the males vary from 1 cm to 17 cm (range 16 cm). Outliers: There do not appear to be any outliers for the males, but the female with hair that is 9 cm long is an outlier. Her hair is unusually short compared to the rest of the females in the sample.

10 Stemplots (or Stem-and-Leaf Plots): Each number in the data set is broken into two pieces a stem and a leaf. The stem is the first part of the number and consists of the beginning digits. The leaf is the last part of the number and consists of the final digit(s). 1. Choose stems (one or more of the leading digits) that divide the data into a reasonable number of groups (at least 5, but not too many). List possible stem values (not just those that actually appear in the data set don t skip stems) in a vertical column. Draw a vertical line to the right of the stems.. The net digit(s) after the stem become(s) the leaf. List the leaf for every observation to the right of the corresponding stem. 3. Include a key eplaining what the stems and leaves represent, e.g., 5 represents.5 seconds It is common to round and/or truncate (leave off) the remaining digits. For eample, in a stemplot of annual salary, we might represent $35,36 as 35 3, 35 4, or as 3 5, depending on our data set. If necessary, consider using split stems. Write each stem more than once, and assign the lower group of leaves to the first stem and the higher group of leaves to the net. For eample, put the leaves -4 with the first stem and the leaves 5-9 with the second. If you do this, be sure that each stem is assigned an equal number of possible leaf digits (two stems, with five possible leaves each; or five stems, with two possible leaves each). To compare two groups, make a back-to-back stemplot. Use the same set of stems and write the leaves for one group to the right and for the other group to the left. Be sure to label each side to indicate which group is being represented. Eample: The data below shows the number of pairs of shoes owned for male and female AP Statistics students. Make a back-to-back stemplot of the data using split stems. Comment on the main differences between the two data sets. Female Male Number of Pairs of Shoes Female Male represents 15 pairs of shoes Shape: Both distributions of # of pairs of shoes owned are unimodal. The distribution for males is very slightly skewed to higher numbers, while the distribution for females is strongly skewed to higher numbers. This means that for both genders, it is more common to own a small number of shoes than a large number. Center: Females tend to own a larger number of shoes, on average, than males. (Median = 15 pairs for females vs. 6 pairs for males). Variability: There is more variability in the number of pairs of shoes owned for females than for males (range = 81 for females vs. 9 for males.) Outliers: There do not appear to be any outliers for the males, but the females who own 64 and 87 pairs of shoes both own many more pairs of shoes than the rest of the females in the sample.

11 Histograms: 1. Divide the range of the data into intervals of equal width. The intervals are called bins. The low value in each bin is included in the bin, but the high value is not. For eample, the bins might be to < 3, 3 to < 6, 6 to < 9, etc. If the data are discrete (the observations take only whole number values) and are tightly packed, the bins are usually centered at the integer values with a width of one unit, so the rectangle for 1 is centered at 1 (.5 to < 1.5), the rectangle for is centered at (1.5 to <.5), etc. There are no set-in-stone rules for how many bins to use (5 to 1 is a common number), but it may be a good idea to see what the graph looks like with different width bins. It can change quite a bit!. Find the frequency (count) or relative frequency (proportion) of individuals in each interval. Put values that fall on a boundary in the interval containing larger values. 3. Label and scale your aes. Place equally spaced tick marks at the boundaries of each interval along the horizontal ais (or in the middle of each interval if the data are discrete). Use either frequency (count) or relative frequency (proportion) on the vertical ais. 4. Draw a rectangle for each interval. Make the bars equal width and leave no gaps between them. The height should correspond to the frequency or relative frequency of individuals in that interval. Histograms and bar graphs are different! o Bar graphs are used for categorical data. Histograms are used for quantitative data. o The bars in bar graphs can be rearranged because the order of the categories shouldn t matter. The bars in histograms can t be rearranged because intervals must be in numerical order. o The bars in bar graphs are generally unconnected. The bars in histograms are connected. Eample: The following data gives the average points scored per game (PTSG) for the 3 NBA teams in the regular season. Draw two relative frequency histograms using different bin widths. Describe the distribution < < < < < < < <114 Frequency Points per Game 98 - < < < < < <116 1 Shape: The distribution of points scored per game is single-peaked and skewed to the right. It is more common for teams to score a smaller number of points than a larger number of points. Center: The median number of points scored per game last season was ( )/ = Variability: The number of points scored per game varied from 98.8 to (range = 14.7 points). Outliers: There do not appear to be any outliers. Frequency Points per Game

Describing Quantitative Data with Numbers Population: The entire collection of individuals or objects that you want to learn about. Sample: A part of the population that is selected for study.

12 Describing Quantitative Data with Numbers Population: The entire collection of individuals or objects that you want to learn about. Sample: A part of the population that is selected for study. Resistant Measure: A measure that is not influenced very much by strong skewness or etreme values. Measures of Center: The most common measures of center are the mean and the median. Mean: The sum of the values divided by the number of observations n i If the n observations in a sample are 1,,..., n, the mean is = =. n n The mean can be thought of as the average value, the fair share value, or the balance point of a distribution. The mean is not a resistant measure. It is very sensitive to outliers and skewness. The mean of a sample is abbreviated (pronounced -bar ) and the mean of a population is abbreviated μ (the Greek letter mu, pronounced myoo ). They are both calculated the same way. The distinction will be important later in the year. If the problem doesn t specify whether the data represent a population or a sample, assume you are dealing with a sample and use. Median (M): The midpoint of a distribution. Half of the observations are smaller than the median and half of the values are larger than the median. To find the median: 1. Put the n observations in order from smallest to largest.. If the number of observations, n, is odd, the median is the middle observation of the ordered list. 3. If the number of observations, n, is even, the median is the average (mean) of the two middle observations in the ordered list. The median can be thought of as the typical value of a variable. The median is a resistant measure. It is not changed greatly by strong skewness or outliers. Comparing the Mean and the Median: The mean and median of a roughly symmetric distribution are close together. If the distribution is eactly symmetric, they are equal. However, outliers and other etreme values drag the mean toward them without having much effect on the median. As a result, in skewed distributions, the mean will be further out in the long tail than is the median.

13 Eample: Here are the amounts of fat (in grams) in McDonald s beef sandwiches. Make a stemplot of the distribution and comment on its shape. Then calculate the mean and the median amount of fat. Sandwich Fat (g) Sandwich Fat (g) Hamburger 9 Big N Tasty 4 Cheeseburger 1 Big N Tasty with Cheese 8 Double Cheeseburger 3 McRib 6 McDouble 19 Mac Snack Wrap 19 Quarter Pounder 19 Angus Bacon & Cheese 39 Quarter Pounder with Cheese 6 Angus Delue 39 Double Quarter Pounder with Cheese 4 Angus Mushroom & Swiss 4 Big Mac 9 d Grams of Fat represents 1 fat grams The distribution of fat content is unimodal and approimately symmetric, so we would epect the median to be close to the mean. Median = 6 grams Mean = ( )/15 = 394/15 = 6.3 grams Eample: Forty students were enrolled in a statistical reasoning course at a California college. The instructor made course materials, grades, and lecture notes available to students on a class web site, and course management software kept track of how often each student accessed any of these web pages. One month after the course began, the instructor requested a report of how many times each student had accessed a class web page. The 4 observations are below. Wasn t it nice of me to put them in order? (not a typo) Here is a dotplot of the data. Describe the distribution. Based on the graph, do you epect the mean or the median to be higher? Calculate the mean and the median to see if you were right. Which measure would be the best choice to describe center in this situation? Median = = = = Number of Visits to Class Website 5 3 The distribution is unimodal and etremely skewed to the right. Most students accessed the website between and times. The students who accessed the website 84 and 331 times are possible outliers. Since the distribution is so skewed and has high outliers, the mean will be pulled towards the high values, and will be much higher than the median. The median will be more representative of the class as a whole.

14 Measures of Variability: Numbers that describe how spread out the data are. The most common are the range, the interquartile range, and the standard deviation. Range: The difference between the maimum and minimum values. Standard Deviation: The most common measure of spread is the standard deviation. It measures the typical or average distance of the observations from the mean. Eample: Each of these distributions has a mean of 5. Rank the standard deviations from lowest to highest. Eplain your answer Highest standard deviation: The typical distance to the mean is the highest Lowest standard deviation: The typical distance to the mean is the lowest Middle The formula for standard deviation is slightly different depending on whether you have all the data for the entire population or are dealing with a sample from the population. For a Sample: If the n observations in a sample are 1,,..., n, and the mean is, the standard deviation is given by: s The sample standard deviation is abbreviated. ( ) + ( ) + + ( ) ( ) 1... n i = = n 1 n 1 s Variance: The square of the standard deviation is called the variance, abbreviated For the Population: The standard deviation of a population of size N with mean μ and observations 1,,..., n is given by: ( μ) + ( μ) + + ( μ) ( μ) 1... n i σ = = N N The population standard deviation is abbreviated σ (the Greek letter sigma). The population variance is abbreviated σ. The reason that we divide by n 1 in a sample is complicated. We ll discuss it later in the year. Always use s rather than σ unless you know that the data represent the entire population, which is rare! s.

15 Calculating the standard deviation by hand: 1. Calculate the mean,.. Find the distance of each observation from the mean (the deviations). 3. Square each of these distances to eliminate negative numbers. 4. Average the squared distances by adding them together and dividing by n 1. This gives the variance, s. 5. Take the square root of the variance to get the standard deviation, s. 6. Interpret your result. The standard deviation is the average or typical distance of the observations from the mean. Eample: The table below shows the sugar content in several types of candy bar. Find the mean and standard deviation of the data. Interpret your result in contet. Candy Bar Sugar (grams) i Deviations i Squared Deviations ( ) i Hershey s Milk Chocolate 31 4 Kit Kat 7 49 York Peppermint Pattie Reese s Peanut Butter Cups Snickers Milky Way Twi Musketeers Mr. Goodbar 7 49 Baby Ruth Total 9 31 Mean: Variance: s = 9 = = 9 grams 1 ( ) i n 1 31 = = grams 1 1 Standard Deviation: s = s = 5.89 grams The sugar contents of the individual candy bars typically differ from the mean sugar content by about 5.9 grams. Properties of the Standard Deviation The standard deviation measures variation around the mean. It should only be used when the mean is chosen as the measure of center. The standard deviation is always greater than or equal to zero. If there is no variability (all observations have the same value), the standard deviation is zero. Larger standard deviations indicate greater variation from the mean. The standard deviation has the same units of measurement as the original observations. This is one reason we usually interpret the standard deviation and not the variance. The standard deviation is not a resistant measure. A few outliers can change its value dramatically.

16 Interquartile Range (IQR): First, calculate the quartiles: 1. Arrange the data in increasing order and locate the median, M. (The median is sometimes called the second quartile, or Q).. The first quartile (Q1) is the median of all the observations lower than the median. 3. The third quartile (Q3) is the median of all the observations higher than the median. The interquartile range is calculated as follows: IQR = Q3 Q1 The IQR is the range of the middle 5% of the data. The range and interquartile range are numbers! Don t say The range is 5 to 3. In that case, the range would be 5. The IQR is not a location! It doesn t make sense to say an observation is in the IQR. 1.5 IQR Rule for Outliers: Any observation that falls more than 1.5 IQR above the third quartile or below the first quartile. Always check for outliers and eamine them closely! They may be errors, or they may tell you something important about your data that you need to pay attention to. Don t ignore them. Boplots (or Bo and Whisker Plots): 1. Find the Five-Number Summary: Minimum Q1 M Q3 Maimum. Check for outliers. You must always show this step. Calculate the IQR. Find Q1 ( 1.5 IQR) and Q3+ ( 1.5 IQR). If you have any data points outside these thresholds, they are outliers. 3. Draw the boplot: Draw a central bo from Q1 to Q3. Draw a vertical line in the bo to mark the median. Draw the whiskers : lines etending from the bo out to the smallest and largest observations that are not outliers. Mark outliers with dots in the appropriate locations. Each section of a boplot contains 5% of the data. The lower quartile is higher than 5% of the data. The median (or second quartile) is higher than 5% of the data. The upper quartile is higher than 75% of the data. Boplots are useful for comparing the center and spread of distributions, but you have to be careful with them. They can mask important information about the shape of a distribution. For instance, you can t tell from a boplot if a distribution has multiple peaks or gaps.

17 Eample: The data below shows the number of tet messages sent by a random sample of students in a day. Draw parallel boplots of the number of tets sent for male and female students. You must show how you determined whether there are outliers. Compare the distributions. What conclusions can you draw about the teting habits of males and females? Male Female Male Q = = 8 Q = = Min = 3, Q 1 = 8, Med = 17, Q 3 = 4.5, Ma = 111 IQR = = ( ) ( ) ( ) ( ) Q 1.5 IQR = = Q IQR = = 94.5 No low outliers because there are no numbers less than is an outlier because it is higher than Female ( ) ( ) ( ) ( ) Median = = Min = 7, Q 1 =, Med = 45, Q 3 = 79, Ma = 156 IQR = 79 = 59 Q 1.5 IQR = = 68.5 Q IQR = = No outliers because there are no numbers less than 68.5 or higher than Males Females # of Tets Sent in Past 4 Hours The females in the sample tet much more, on average, than the males (median = 45 for females and 17 for males). Since the median number of tets for females is higher than the third quartile for males, we can see that the top 5% of females tet more than the bottom 75% of the males. Both distributions are skewed to the right, meaning that it is more common to send smaller numbers of tets than larger ones. There is more variability in # of tets sent for females than for males (IQR = 59 for females and 34.5 for males). There is one outlier for the males. He sent 111 tets, which is unusually high. There were no outliers for the females. Choosing Measures of Center and Spread: Use the median and IQR for describing a skewed distribution or a distribution with strong outliers. Use the mean and standard deviation for describing reasonably symmetric distributions without outliers. ALWAYS GRAPH YOUR DATA! Numerical measures of center and spread report specific facts about a distribution, but don t give information about its entire shape. You may miss something important if you don t graph the data.

Exploring Data. How to Explore Data

Exploring Data. How to Explore Data Exploring Data Statistics is the art and science of learning from data. This may include: Designing appropriate tools to collect data. Organizing data in a meaningful way. Displaying data with appropriate