Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets

Size: px

Start display at page:

Download "Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets"

Iris Patterson
5 years ago
Views:

1 Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 76

2 Introduction This course is an introduction to statistical data analysis. The purpose of the course is to acquaint students with fundamental techniques for gathering data, describing data sets, and most importantly, making conclusions based on data. Topics that will be covered include probability, probability distributions, sampling, confidence intervals, hypothesis testing, correlation, and regression. James V. Lambers Statistical Data Analysis 2 / 76

3 The R Project To illustrate and work with concepts and techniques presented in this course, we will use a software tool known as R, which provides a programming environment for statistical computing and graphics. It is freely available for download from the site Throughout this course, as concepts are presented, relevant R functions and sample code will be given. James V. Lambers Statistical Data Analysis 3 / 76

4 Descriptive Statistics Inferential Statistics Ethics in Statistics Descriptive Statistics The purpose of descriptive statistics to summarize and display data in such a way that it can readily be interpreted. Examples of descriptive statistics are as follows: The average, or mean is a convenient way of describing a set of many numbers with just a single number. A chart is useful for organizing and summarizing data in meaningful ways. James V. Lambers Statistical Data Analysis 4 / 76

5 Descriptive Statistics Inferential Statistics Ethics in Statistics Example Consider a list of test scores in a class with many students: The average of all of these test scores is approximately 72.5, which suggests that the overall performance of the class on the test was a C. James V. Lambers Statistical Data Analysis 5 / 76

6 Descriptive Statistics Inferential Statistics Ethics in Statistics Example, cont d We can also gauge the overall performance of the class with this chart in which the scores are categorized according to their letter grade (assuming straight-scale letter-grading): Range Number of scores in range which shows that the majority of the students earned C s or D s. James V. Lambers Statistical Data Analysis 6 / 76

7 Descriptive Statistics Inferential Statistics Ethics in Statistics Inferential Statistics The other, much more sophisticated branch of statistics is inferential statistics, which is used to make actual claims about an entire (large) population based on a (relatively small) sample of data. Related topics: Confidence intervals Hypothesis testing Goodness-of-fit tests Correlation and regression James V. Lambers Statistical Data Analysis 7 / 76

8 Descriptive Statistics Inferential Statistics Ethics in Statistics Example For example, suppose that a pollster wanted to determine the percentage of all registered voters in California that would support a certain ballot measure. It would not be practical to question the entire population consisting of all of these voters, as there are millions of them. Instead, the pollster would question a sample consisting of a reasonable number of these voters (such as, for example, 200 voters), and then use inferential statistics to make a conclusion about the voting preference of the entire population based on the data obtained from the sample. James V. Lambers Statistical Data Analysis 8 / 76

9 Descriptive Statistics Inferential Statistics Ethics in Statistics The Distinction The essential difference between descriptive and inferential statistics lies in the size of the population about which conclusions are being made. In descriptive statistics, conclusions are made about a relatively small population based on direct observations of every member of that population. In inferential statistics, conclusions are made about a relatively large population based on descriptive statistics applied to a small sample from that population. James V. Lambers Statistical Data Analysis 9 / 76

10 Descriptive Statistics Inferential Statistics Ethics in Statistics Ethics in Statistics The example of inferential statistics given above, concerning a pollster, can be expanded to illustrate important aspects of ethics in statistics. In order to draw sound conclusions about a large population, it is essential that a sample of that population be representative of that population; otherwise, the sample is said to be biased. James V. Lambers Statistical Data Analysis 10 / 76

11 Descriptive Statistics Inferential Statistics Ethics in Statistics 1936 Presidential Election This occurred during the presidential election of 1936, in which a poll of a sample of voters was conducted in order to determine whether the majority would vote for Franklin D. Roosevelt, the Democratic candidate, or Alf Landon, the Republican candidate. The conclusion made from the poll was that Landon would win the election, when in fact Roosevelt won. James V. Lambers Statistical Data Analysis 11 / 76

12 Descriptive Statistics Inferential Statistics Ethics in Statistics Where Did They Go Wrong? The reason why the poll yielded an incorrect conclusion was that telephone directories were used to obtain voter names, and in 1936, telephones existed primarily in more affluent households, which tended to vote Republican. That is, the method of polling led to an unintentional bias. In some cases, unfortunately, a sample can be biased intentionally, in order to make a false conclusion that supports one s agenda. James V. Lambers Statistical Data Analysis 12 / 76

13 Descriptive Statistics Inferential Statistics Ethics in Statistics Internet Polling Just as telephone polling was problematic decades ago, internet polling is problematic today. It is very difficult to ensure that voters in an internet poll vote only once, and it is impossible to ensure that those who vote are actually representative of any given population. For this reason, such polls are generally labeled as unscientific, although this disclaimer is not always noted by those who read the results of such polls. James V. Lambers Statistical Data Analysis 13 / 76

14 Descriptive Statistics Inferential Statistics Ethics in Statistics Worst Practices in Another example of questionable or unethical uses of statistics is the tactic of emphasizing differences through display. Suppose that over a period of three years, the average price of a home in a certain city has increased from $380,000 to $390,000 to $400,000. This data can be displayed in different ways to either emphasize or de-emphasize the increase. James V. Lambers Statistical Data Analysis 14 / 76

15 Descriptive Statistics Inferential Statistics Ethics in Statistics Different approaches to displaying the same increase in home prices over a three-year period James V. Lambers Statistical Data Analysis 15 / 76

16 Descriptive Statistics Inferential Statistics Ethics in Statistics Manipulation of Axes Note that both charts display exactly the same data, but whereas the chart on the left uses a vertical scale that has the effect of making the yearly increase seem negligible, the chart on the right uses a vertical scale that makes this same increase seem much more dramatic. People who report statistics can, unfortunately, use tactics like this to subtly influence consumers of the information that they provide. James V. Lambers Statistical Data Analysis 16 / 76

17 Data Sources Levels of Measurement Scales In this section, we discuss various approaches to data collection, and the ramifications of each. It is important to consider both the source of the data, and the method of measurement used during its collection. First, we give some definitions. data (singular datum) are values assigned to observations that are made about a population. A parameter is a type of data that describes a characteristic of a population, such as the income level of every member of the labor force within a city. By contrast, a statistic is data that describes a characteristic of a sample, such as the favorite candy bar of every member of a focus group. Information is data transformed into useful facts, typically through inferential statistics. James V. Lambers Statistical Data Analysis 17 / 76

18 Data Sources Levels of Measurement Scales Example Suppose that a large corporation, that has hundreds of stores throughout the United States, wants to determine the trend of its sales from year to year. The average revenue of all of its stores would be considered a parameter, where the population consists of all stores. However, the corporation could consider just a sample of its stores and compute the average revenue for this subset, which would be a statistic. Suppose that this average is found to be dropping from year to year. From this data, the corporation could glean the essential information that it is in danger of going bankrupt if this trend continues, and must act before it is too late. James V. Lambers Statistical Data Analysis 18 / 76

19 Data Sources Levels of Measurement Scales Data Sources We now examine various sources of data. Regardless of the type of source, data can be categorized as either primary data, which is data collected by an individual or organization for their own use, as opposed to secondary data, which is data collected by others (such as a government agency). Regardless of whether one collects their own data or obtains it from elsewhere, it is essential to ensure that this data is collected from a sample that is representative of the population that is being studied. James V. Lambers Statistical Data Analysis 19 / 76

20 Data Sources Levels of Measurement Scales Direct Observation Direct observation is an approach to data collection in which subjects of the observation are in their natural environment. That is, there is little or no interaction between the subjects and the observer. Some examples are observing animals in the wild or people in public places. An advantage of this approach is that the subjects are not influenced by the data collection process, which helps ensure more reliable data. A disadvantage is lack of control over the sample, thus making it difficult to ensure that it is representative of the population of interest. James V. Lambers Statistical Data Analysis 20 / 76

21 Data Sources Levels of Measurement Scales Experiments A clinical trial for a new medication is an example of an experiment, which is another type of data source. In an experiment, unlike with direct observation, a statistician has more control over the makeup of the sample, to ensure that it is representative of the population of interest. On the other hand, because the participants are aware that data is being collected from them, they might (even unintentionally) be biased, thus influencing this data. James V. Lambers Statistical Data Analysis 21 / 76

22 Data Sources Levels of Measurement Scales Surveys In surveys, subjects are asked direct questions in order to produce the desired data. In this approach, it is essential to avoid two kinds of bias: bias due to the subjects not being a representative sample of the population, and bias due to the form of the questions being asked, which can substantially influence the data. James V. Lambers Statistical Data Analysis 22 / 76

23 Data Sources Levels of Measurement Scales Levels of Measurement Scales Now that we know of some sources from which data can be gathered, we need to also know about ways in which it can be measured, and the ramifications of each. James V. Lambers Statistical Data Analysis 23 / 76

24 Data Sources Levels of Measurement Scales Nominal Nominal measurement is a purely qualitative form of measurement, in which observations are assigned to categories, such as one s gender, occupation, or state of residence. It does not make sense to perform mathematical operations or comparisons of any kind on such measurements, even if the categories are labeled numerically (for example, zip codes). James V. Lambers Statistical Data Analysis 24 / 76

25 Data Sources Levels of Measurement Scales Ordinal The next step up from nominal measurement, on the spectrum from qualitative to quantitative, is ordinal measurement. Such measurements can be either qualitative or quantitative, and they can be ranked; examples would be the order of finish in a race, or the number of stars given to a movie by a critic as a rating. However, other mathematical operations do not make sense; for instance, one cannot claim that a movie that earns four stars is twice as good as a movie that earns two stars, or that the difference in quality between any 2-star movie and any 4-star movie is the same. James V. Lambers Statistical Data Analysis 25 / 76

26 Data Sources Levels of Measurement Scales Interval Interval measurements are purely quantitative, and can be added or subtracted. An example would be temperature, since differences in temperature measurements are meaningful. However, interval measurements cannot be multiplied or divided; that is, one hundred degrees is not considered twice as warm as fifty degrees. James V. Lambers Statistical Data Analysis 26 / 76

27 Data Sources Levels of Measurement Scales Ratio The most versatile form of measurement is ratio measurement. For such measurements, addition, subtraction, multiplication, division and comparison are valid. Examples of ratio measurement are age, weight, or salary. What distinguishes ratio measurements from interval measurements is that there is a zero point that makes ratios have meaning. A useful rule of thumb is the twice as much rule: if doubling a measurement has a consistent meaning, then the measurement is a ratio measurement rather than an interval measurement. James V. Lambers Statistical Data Analysis 27 / 76

28 Frequency Distributions Stem-and-Leaf Displays Charts Frequency Distributions A frequency distribution is a table that lists specific intervals, called classes, along with the number of data observations that fall into each class. The number of observations belonging to a particular class is called a frequency. James V. Lambers Statistical Data Analysis 28 / 76

29 Frequency Distributions Stem-and-Leaf Displays Charts Example Suppose that a survey of 100 voters is taken, in which the age of each respondent is recorded. The ages of the respondents are James V. Lambers Statistical Data Analysis 29 / 76

30 Frequency Distributions Stem-and-Leaf Displays Charts Example, cont d Since voters must be at least 18 years of age, classes could be chosen as follows: 18-27, 28-37, and so on, up to 78-87, since the maximum age among all respondents is 86. Then, the frequency distribution is Age Range Number of Respondents Frequency distribution of ages of 100 voters surveyed James V. Lambers Statistical Data Analysis 30 / 76

31 Frequency Distributions Stem-and-Leaf Displays Charts Frequency Distributions in R Suppose that the 100 ages from the preceding example are stored in a text file, called ages.txt, as a simple list of numbers separated by spaces. To create this frequency distribution in R, the following commands can be used: > ages=scan("ages.txt") > breaks = seq(min(ages),max(ages)+10,by=10) > freq = table(cut(ages,breaks,right=false)) > freq [18,28) [28,38) [38,48) [48,58) [58,68) [68,78) [78,88) James V. Lambers Statistical Data Analysis 31 / 76

32 Frequency Distributions Stem-and-Leaf Displays Charts Dissection of R Code In Windows, by default, R assumes that files are stored in your My Documents folder; otherwise, a full pathname should be specified as the argument to scan. The min and max functions return the minimum and maximum values, respectively, of their argument. The seq function returns a sequence of numbers with specified starting value, ending value, and spacing. In this case, 10 is added to the maximum value to ensure that it is included in a class. James V. Lambers Statistical Data Analysis 32 / 76

33 Frequency Distributions Stem-and-Leaf Displays Charts Dissection, cont d The cut function determines which class each element of its first argument belongs to, where the classes are specified by the second argument. The third argument right=false is used to specify that the right endpoint of each class is not included in the class. Finally, the freq function generates the frequency distribution from the output of cut. James V. Lambers Statistical Data Analysis 33 / 76

34 Frequency Distributions Stem-and-Leaf Displays Charts Class Selection In determining the classes for a frequency distribution, the following guidelines should be observed: All classes should be of equal size, so that the number of observations in each class can be compared in a meaningful way. There should be between 5 and 15 classes. Using too few classes fails to give a sense of the distribution of observations, and having too many classes makes comparing classes less useful. Classes should not be open-ended, if possible. For example, if observations are ages, there should not be a class of over age 50. Classes should be exhaustive, so that all data observations can be included. Note that the frequency distribution in the preceding example follows these guidelines; had classes spanned 20 years instead of 10, there would have been too few. James V. Lambers Statistical Data Analysis 34 / 76

35 Frequency Distributions Stem-and-Leaf Displays Charts Variations Some variations on a frequency distribution are: A relative frequency distribution, all frequencies are divided by the total number of observations, in order to obtain the percentage of observations in each class. As before, classes should be exhaustive, so that the total of all relative frequencies is 100%. A cumulative frequency distribution lists, for each class, the percentage of observations that are less than or equal the values in the class. A histogram is a bar graph in which the height of each bar is the number of observations in a class. James V. Lambers Statistical Data Analysis 35 / 76

36 Frequency Distributions Stem-and-Leaf Displays Charts Histograms A histogram can easily be created in R, using the hist command. For example, from the age data used in previous examples, the command hist(ages) produces the histogram shown on the next slide. With this simple usage of hist, the classes are chosen automatically; a second argument, breaks, can be used to specify the classes manually. For example, hist(ages, breaks=c(18,27.5,37.5,47.5,57.5,67.5,77.5,87)) produces a histogram that conforms to the frequency distribution given in the preceding example. James V. Lambers Statistical Data Analysis 36 / 76

37 Frequency Distributions Stem-and-Leaf Displays Charts Histogram Example Histogram of age data produced in R James V. Lambers Statistical Data Analysis 37 / 76

38 Frequency Distributions Stem-and-Leaf Displays Charts Stem-and-Leaf Display A stem-and-leaf display is a table for displaying integer-valued observations in which each observation is decomposed into a leaf, which is the ones digit, and a stem, which consists of the rest of the digits. The display consists of two columns; the left column lists stems and the right column lists all leaves with their corresponding stems. An advantage of using a stem-and-leaf display is that all of the original observations are actually visible in the display, as opposed to a frequency distribution that only lists the number of observations that fall within each class. James V. Lambers Statistical Data Analysis 38 / 76

39 Frequency Distributions Stem-and-Leaf Displays Charts Stem-and-Leaf Display of Age Data James V. Lambers Statistical Data Analysis 39 / 76

40 Frequency Distributions Stem-and-Leaf Displays Charts Pie Charts A pie chart is a circle divided into sectors, that are associated with classes. The central angle of each sector is equal to the relative frequency of the corresponding class, multiplied by 360 degrees. As a result, the size of each sector is indicative of the relative frequency of each class. It is best to also use colors to distinguish the classes. A pie chart for the age data used in previous examples is shown on the next slide. It is generated using the R command pie(freq) where freq is the frequency distribution generated earlier. James V. Lambers Statistical Data Analysis 40 / 76

41 Frequency Distributions Stem-and-Leaf Displays Charts Pie Chart Example Pie chart generated from frequency distribution of age data James V. Lambers Statistical Data Analysis 41 / 76

42 Frequency Distributions Stem-and-Leaf Displays Charts Bar Charts A bar chart is like a histogram, except that the height of each bar is determined by a specific data value, rather than the frequency of a class. Thus, a bar chart is used to highlight the actual values in the data set, as opposed to a pie chart, which highlights the relative sizes of classes. The bar chart shown on the next slide is generated in R from the age data using the command barplot(sort(ages)) James V. Lambers Statistical Data Analysis 42 / 76

43 Frequency Distributions Stem-and-Leaf Displays Charts Bar Chart Example Bar chart generated from sorted age data James V. Lambers Statistical Data Analysis 43 / 76

44 Frequency Distributions Stem-and-Leaf Displays Charts Line Charts A line chart is useful for illustrating a relationship between two sets of data, particularly when there is a large number of observations. Observations are plotted as points on the chart, and the x- and y-coordinates of the points are obtained from the observations of each data set. The points are then connected to help depict the relationship between the sets. James V. Lambers Statistical Data Analysis 44 / 76

45 Mean Median Mode Choosing a Measure It is highly desirable to be able to characterize a data set using a single value. Suppose that a data set consists of numerical values, and that the observations are plotted as points on the real number line. Then, a number that is at the center of these points can serve as such a characterizing value. This value is called a measure of central tendency. James V. Lambers Statistical Data Analysis 45 / 76

46 Mean Median Mode Choosing a Measure Mean Given a set of n numerical observations {x 1, x 2,..., x n } of a population, the mean of the set is µ = x 1 + x x n. n When the observations are drawn from a sample, rather than an entire population, then the mean is denoted by x: x = x 1 + x x n. n The mean can be defined more concisely using sigma notation: µ = 1 n n x i. i=1 James V. Lambers Statistical Data Analysis 46 / 76

47 Mean Median Mode Choosing a Measure The Mean in R To compute the mean of a data set in R, the mean function can be used. For example, with the age data used in previous example, we have: > mean(ages) [1] James V. Lambers Statistical Data Analysis 47 / 76

48 Mean Median Mode Choosing a Measure Weighted Mean In some instances, a measure of central tendency needs to be computed from the values in a data set, in which some values should be assigned more weights than others. This leads to the notion of a weighted mean µ = w 1x 1 + w 2 x w n x n w 1 + w w n = The weights must all be positive. n w i x i i=1. n w i i=1 James V. Lambers Statistical Data Analysis 48 / 76

49 Mean Median Mode Choosing a Measure Example Suppose that an overall course grade is computed by weighting a homework average h by 10%, two test grades t 1 and t 2 by 25% each, and a final exam f by 40%. Then the overall grade is 10h + 25t t f James V. Lambers Statistical Data Analysis 49 / 76

50 Mean Median Mode Choosing a Measure Weighted Mean in R To compute a weighted mean in R, the weighted.mean function can be used. The first argument is a vector of observations, and the second argument is a vector of weights. For example, suppose the homework average is 80, the test scores are 75 and 85, and the final exam score is 90. Then, the weighted mean is > grades <- c(80,75,85,90) > weighted.mean(grades,c(10,25,25,50)) [1] James V. Lambers Statistical Data Analysis 50 / 76

51 Mean Median Mode Choosing a Measure Mean of Grouped Data When data observations are summarized in a frequency distribution, an approximation of their mean can readily be obtained. Suppose that the frequency distribution has n classes, with frequencies f 1, f 2,..., f n. Furthermore, suppose that the ith class has a representative value c i ; for example, it could be the average of the lower and upper bounds of the class. James V. Lambers Statistical Data Analysis 51 / 76

52 Mean Median Mode Choosing a Measure Approximating the Mean Then an approximation of the mean is µ = n c i f i i=1. n f i i=1 It follows that if each class contains only a single value, then this approximate mean is given by a weighted mean of these values, in which the frequencies are the weights. James V. Lambers Statistical Data Analysis 52 / 76

53 Mean Median Mode Choosing a Measure Example Consider the frequency distribution of age data given earlier. The classes are age ranges 18-27, 28-37, and so on. If we average the upper and lower bounds of each class, we obtain representative values of the classes. In R, this can be accomplished using the following statements, and the breaks variable that was defined earlier. > breaks [1] > class midpoints=(breaks[1:7]+(breaks[2:8]-1))/2 > class midpoints [1] James V. Lambers Statistical Data Analysis 53 / 76

54 Mean Median Mode Choosing a Measure Vectors in R Note that components of a vector are accessed using indices enclosed in square brackets, and that the first component of each vector has the index of 1. Also, a contiguous portion of a vector can be extracted by specifiying a range of indices with a colon. For example, breaks[1:7] is a vector consisting of the first 7 elements, numbered 1 through 7, of breaks. James V. Lambers Statistical Data Analysis 54 / 76

55 Mean Median Mode Choosing a Measure Example, cont d Now, an approximate mean can be computed using (52): > sum(class midpoints*freq)/sum(freq) [1] 52.5 Note that this approximation is very close to the actual mean of Also, note that vectors of the same length can be multiplied; the result is a vector of products of corresponding components of the vectors. Then, sum can be used to compute the sum of all of the components of a vector. James V. Lambers Statistical Data Analysis 55 / 76

56 Mean Median Mode Choosing a Measure Median The median of a data set is, informally, the value such that half of the values in the set are less than the median, and half are greater than the median. Specifically, if the number n of observations in the set is odd, then the median is the middle value of the set, at position (n + 1)/2, if the values are sorted. If n is even, then the median is defined to the average of the values at positions n/2 and n/ The median function in R can be used to compute the median of a vector of observations. For example, using the age data, we have > median(ages) [1] 52.5 James V. Lambers Statistical Data Analysis 56 / 76

57 Mean Median Mode Choosing a Measure Mode The mode of a data set is the value that occurs most often within the set. It is possible for a data set to have more than one mode. There is no function in R for computing the mode, but if v is a vector containing all of the values of a data set, the following statements can be used to find its modes. > vtable=table(v) > where <- vtable==max(vtable) > names(vtable)[where] James V. Lambers Statistical Data Analysis 57 / 76

58 Mean Median Mode Choosing a Measure Code Dissection The first statement vtable=table(v) creates a one-row table from v, in which the data values of v are the header names of the columns in vtable, and the values in the one row of vtable are the counts of those values in v. The second statement where <- vtable==max(vtable) finds the indices within the table at which the counts are equal to the maximum. The variable where is a logical vector, with the same number of elements as there are distinct values in v. Each element of where is TRUE if the count of the corresponding value is equal to the maximum, and FALSE otherwise. James V. Lambers Statistical Data Analysis 58 / 76

59 Mean Median Mode Choosing a Measure Code Dissection, cont d The third statement names(vtable)[where] uses the names function to extract the column names from vtable, which are also the distinct values in the original data set in v. Then, the subscript [where] extracts only those column names in which the corresponding counts are equal to the maximum, which are the modes. James V. Lambers Statistical Data Analysis 59 / 76

60 Mean Median Mode Choosing a Measure Choosing a Measure Given these three measure of central tendency, it is natural to ask which one should be used. The mean can be skewed if the data set contains outliers, thus making it an unreliable measure. The median, on the other hand, is not susceptible to such bias. Finally, the mode is not often used, except with nominal data, which cannot be compared or added anyway. James V. Lambers Statistical Data Analysis 60 / 76

61 Range Variance Standard Deviation Quartiles A measure of central tendency is quite limited in its ability to describe a data set. For example, the values may be clustered closely around the mean or median, or they may be widely spread out. As such, we can use a measure of dispersion that describes how far individual data values deviate from a measure of central tendency. James V. Lambers Statistical Data Analysis 61 / 76

62 Range Variance Standard Deviation Quartiles Range The range of a set of data observations is simply the difference between the largest and smallest values. This measure of dispersion has the advantage that it is very easy to compute. However, it uses very little of the data, and is unduly influenced by outliers. The range function in R can be used to obtain the range of a set of observations. > range(ages) [1] James V. Lambers Statistical Data Analysis 62 / 76

63 Range Variance Standard Deviation Quartiles Population Variance The variance of a population, denoted by σ 2, is obtained from the deviation of each observation from the mean: σ 2 = 1 N N (x j µ) 2. j=1 An equivalent formula, that is less tedious for larger populations, is σ 2 = 1 N xj 2 µ 2. N j=1 James V. Lambers Statistical Data Analysis 63 / 76

64 Range Variance Standard Deviation Quartiles Sample Variance The formula for the variance of a sample, denoted by s 2, is slightly different: s 2 = 1 N (x j x) 2. N 1 The division by (N 1) instead of N is intended to compensate for the tendency of the sample variance, when dividing by N, to underestimate the population variance. The var function in R computes the sample variance of a vector of observations that is given as an argument. j=1 James V. Lambers Statistical Data Analysis 64 / 76

65 Range Variance Standard Deviation Quartiles Standard Deviation For both a population and a sample, the standard deviation is the square root of the variance. That is, the standard deviation of a population is σ = 1 N (x j µ) N 2, j=1 whereas for a sample, we have s = 1 N (x j x) N 1 2. An advantage of the standard deviation over the variance, as a measure of dispersion, is that the standard deviation is measured using the same units as the original data. j=1 James V. Lambers Statistical Data Analysis 65 / 76

66 Range Variance Standard Deviation Quartiles Standard Deviation in R The sd function in R computes the sample standard deviation of a given vector of observations. For example, from the age data, we obtain > var(ages) [1] > sd(ages) [1] James V. Lambers Statistical Data Analysis 66 / 76

67 Range Variance Standard Deviation Quartiles Standard Deviation of Grouped Data For grouped data in a relative frequency distribution, with n classes, class values c j (for example, the midpoint of the values in the class), and relative frequencies f j, j = 1, 2,..., n, the population standard deviation can be computed as follows: n σ = cj 2f j µ 2. j=1 James V. Lambers Statistical Data Analysis 67 / 76

Range Variance Standard Deviation Quartiles Empirical Rule The empirical rule states that if the distribution of a set of observations is bell-shaped, meaning that the distribution is symmetric

68 Range Variance Standard Deviation Quartiles Empirical Rule The empirical rule states that if the distribution of a set of observations is bell-shaped, meaning that the distribution is symmetric around the mean and decreases toward zero away from the mean, then approximately 68, 95, and 99.7 % of the observations fall within 1, 2, and 3 standard deviations of the mean, respectively. James V. Lambers Statistical Data Analysis 68 / 76

69 Range Variance Standard Deviation Quartiles Chebyshev s Theorem Another rule of thumb, that applies even to distributions that are not bell-shaped or symmetric, is Chebyshev s Theorem, which states that if k > 1, then at least ( 1 1 ) k 2 100% of the observations fall within k standard deviations of the mean. James V. Lambers Statistical Data Analysis 69 / 76

70 Range Variance Standard Deviation Quartiles Quartiles Another measure of dispersion is the use of quartiles, which are obtained by dividing a data set into four segments that, as much as possible, contain an equal number of observations. Just as the median is the middle value of the data set, the first quartile, denoted by Q 1, is the median of the lower half of the data, and the third quartile, denoted by Q 3, is the median of the upper half of the data. There are various ways of determining what constitutes the lower and upper halves; some statisticians include the median in these halves if it is an actual observation, but some do not. James V. Lambers Statistical Data Analysis 70 / 76

71 Range Variance Standard Deviation Quartiles Interquartile Range and Outliers Once the first and third quartiles are computed, the interquartile range, denoted by IQR, is defined by IQR = Q 3 Q 1. This value is used to measure the spread of the center half of data, and identify outliers. A rule of thumb is to classify any values less than Q 1 1.5IQR, or greater than Q IQR, as outliers. James V. Lambers Statistical Data Analysis 71 / 76

72 Range Variance Standard Deviation Quartiles Quartiles in R The following R statements illustrate the computation of Q 1, Q 3 and the IQR, in order: > quantile(ages,0.25) 25% > quantile(ages,0.75) 75% 66 > IQR(ages) [1] James V. Lambers Statistical Data Analysis 72 / 76

73 Range Variance Standard Deviation Quartiles Five-point Summary The five-point summary of a data set consists of the minimum value, Q 1, the median (also denoted by Q 2 ), Q 3, and the maximum value. It can be obtained using the summary function in R. For example, from the age data, we obtain > summary(ages) Min. 1st Qu. Median Mean 3rd Qu. Max James V. Lambers Statistical Data Analysis 73 / 76

74 Range Variance Standard Deviation Quartiles Box-and-Whisker Plot These measures can be used to construct a box-and-whisker plot, which displays the interquartile range and outliers. A box is drawn with opposing boundaries placed at Q 1 and Q 3, with a parallel line drawn within the box at the median. Then, perpendicular lines, which are the whiskers, are drawn from Q 1 to the minimum value, and from Q 3 to the maximum value. The length of the box is equal to IQR, and if the length of either of the whiskers is more than 1.5 times the width of the box, then the value at the end of the whisker is an outlier. James V. Lambers Statistical Data Analysis 74 / 76

75 Range Variance Standard Deviation Quartiles Box-and-Whisker Plots in R A box-and-whisker plot can be produced in R using the boxplot command. For example, the plot shown on the next slide is obtained from the age data used in earlier examples using the command boxplot(ages) James V. Lambers Statistical Data Analysis 75 / 76

76 Range Variance Standard Deviation Quartiles Box-and-Whisker Plot Example Box-and-whisker plot produced from age data James V. Lambers Statistical Data Analysis 76 / 76

Introduction to Statistics

Introduction to Statistics Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing,