Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets

Size: px
Start display at page:

Download "Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets"

Transcription

1 Introduction to Statistical Data Analysis Lecture 1: Working with Data Sets James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 76

2 Introduction This course is an introduction to statistical data analysis. The purpose of the course is to acquaint students with fundamental techniques for gathering data, describing data sets, and most importantly, making conclusions based on data. Topics that will be covered include probability, probability distributions, sampling, confidence intervals, hypothesis testing, correlation, and regression. James V. Lambers Statistical Data Analysis 2 / 76

3 The R Project To illustrate and work with concepts and techniques presented in this course, we will use a software tool known as R, which provides a programming environment for statistical computing and graphics. It is freely available for download from the site Throughout this course, as concepts are presented, relevant R functions and sample code will be given. James V. Lambers Statistical Data Analysis 3 / 76

4 Descriptive Statistics Inferential Statistics Ethics in Statistics Descriptive Statistics The purpose of descriptive statistics to summarize and display data in such a way that it can readily be interpreted. Examples of descriptive statistics are as follows: The average, or mean is a convenient way of describing a set of many numbers with just a single number. A chart is useful for organizing and summarizing data in meaningful ways. James V. Lambers Statistical Data Analysis 4 / 76

5 Descriptive Statistics Inferential Statistics Ethics in Statistics Example Consider a list of test scores in a class with many students: The average of all of these test scores is approximately 72.5, which suggests that the overall performance of the class on the test was a C. James V. Lambers Statistical Data Analysis 5 / 76

6 Descriptive Statistics Inferential Statistics Ethics in Statistics Example, cont d We can also gauge the overall performance of the class with this chart in which the scores are categorized according to their letter grade (assuming straight-scale letter-grading): Range Number of scores in range which shows that the majority of the students earned C s or D s. James V. Lambers Statistical Data Analysis 6 / 76

7 Descriptive Statistics Inferential Statistics Ethics in Statistics Inferential Statistics The other, much more sophisticated branch of statistics is inferential statistics, which is used to make actual claims about an entire (large) population based on a (relatively small) sample of data. Related topics: Confidence intervals Hypothesis testing Goodness-of-fit tests Correlation and regression James V. Lambers Statistical Data Analysis 7 / 76

8 Descriptive Statistics Inferential Statistics Ethics in Statistics Example For example, suppose that a pollster wanted to determine the percentage of all registered voters in California that would support a certain ballot measure. It would not be practical to question the entire population consisting of all of these voters, as there are millions of them. Instead, the pollster would question a sample consisting of a reasonable number of these voters (such as, for example, 200 voters), and then use inferential statistics to make a conclusion about the voting preference of the entire population based on the data obtained from the sample. James V. Lambers Statistical Data Analysis 8 / 76

9 Descriptive Statistics Inferential Statistics Ethics in Statistics The Distinction The essential difference between descriptive and inferential statistics lies in the size of the population about which conclusions are being made. In descriptive statistics, conclusions are made about a relatively small population based on direct observations of every member of that population. In inferential statistics, conclusions are made about a relatively large population based on descriptive statistics applied to a small sample from that population. James V. Lambers Statistical Data Analysis 9 / 76

10 Descriptive Statistics Inferential Statistics Ethics in Statistics Ethics in Statistics The example of inferential statistics given above, concerning a pollster, can be expanded to illustrate important aspects of ethics in statistics. In order to draw sound conclusions about a large population, it is essential that a sample of that population be representative of that population; otherwise, the sample is said to be biased. James V. Lambers Statistical Data Analysis 10 / 76

11 Descriptive Statistics Inferential Statistics Ethics in Statistics 1936 Presidential Election This occurred during the presidential election of 1936, in which a poll of a sample of voters was conducted in order to determine whether the majority would vote for Franklin D. Roosevelt, the Democratic candidate, or Alf Landon, the Republican candidate. The conclusion made from the poll was that Landon would win the election, when in fact Roosevelt won. James V. Lambers Statistical Data Analysis 11 / 76

12 Descriptive Statistics Inferential Statistics Ethics in Statistics Where Did They Go Wrong? The reason why the poll yielded an incorrect conclusion was that telephone directories were used to obtain voter names, and in 1936, telephones existed primarily in more affluent households, which tended to vote Republican. That is, the method of polling led to an unintentional bias. In some cases, unfortunately, a sample can be biased intentionally, in order to make a false conclusion that supports one s agenda. James V. Lambers Statistical Data Analysis 12 / 76

13 Descriptive Statistics Inferential Statistics Ethics in Statistics Internet Polling Just as telephone polling was problematic decades ago, internet polling is problematic today. It is very difficult to ensure that voters in an internet poll vote only once, and it is impossible to ensure that those who vote are actually representative of any given population. For this reason, such polls are generally labeled as unscientific, although this disclaimer is not always noted by those who read the results of such polls. James V. Lambers Statistical Data Analysis 13 / 76

14 Descriptive Statistics Inferential Statistics Ethics in Statistics Worst Practices in Another example of questionable or unethical uses of statistics is the tactic of emphasizing differences through display. Suppose that over a period of three years, the average price of a home in a certain city has increased from $380,000 to $390,000 to $400,000. This data can be displayed in different ways to either emphasize or de-emphasize the increase. James V. Lambers Statistical Data Analysis 14 / 76

15 Descriptive Statistics Inferential Statistics Ethics in Statistics Different approaches to displaying the same increase in home prices over a three-year period James V. Lambers Statistical Data Analysis 15 / 76

16 Descriptive Statistics Inferential Statistics Ethics in Statistics Manipulation of Axes Note that both charts display exactly the same data, but whereas the chart on the left uses a vertical scale that has the effect of making the yearly increase seem negligible, the chart on the right uses a vertical scale that makes this same increase seem much more dramatic. People who report statistics can, unfortunately, use tactics like this to subtly influence consumers of the information that they provide. James V. Lambers Statistical Data Analysis 16 / 76

17 Data Sources Levels of Measurement Scales In this section, we discuss various approaches to data collection, and the ramifications of each. It is important to consider both the source of the data, and the method of measurement used during its collection. First, we give some definitions. data (singular datum) are values assigned to observations that are made about a population. A parameter is a type of data that describes a characteristic of a population, such as the income level of every member of the labor force within a city. By contrast, a statistic is data that describes a characteristic of a sample, such as the favorite candy bar of every member of a focus group. Information is data transformed into useful facts, typically through inferential statistics. James V. Lambers Statistical Data Analysis 17 / 76

18 Data Sources Levels of Measurement Scales Example Suppose that a large corporation, that has hundreds of stores throughout the United States, wants to determine the trend of its sales from year to year. The average revenue of all of its stores would be considered a parameter, where the population consists of all stores. However, the corporation could consider just a sample of its stores and compute the average revenue for this subset, which would be a statistic. Suppose that this average is found to be dropping from year to year. From this data, the corporation could glean the essential information that it is in danger of going bankrupt if this trend continues, and must act before it is too late. James V. Lambers Statistical Data Analysis 18 / 76

19 Data Sources Levels of Measurement Scales Data Sources We now examine various sources of data. Regardless of the type of source, data can be categorized as either primary data, which is data collected by an individual or organization for their own use, as opposed to secondary data, which is data collected by others (such as a government agency). Regardless of whether one collects their own data or obtains it from elsewhere, it is essential to ensure that this data is collected from a sample that is representative of the population that is being studied. James V. Lambers Statistical Data Analysis 19 / 76

20 Data Sources Levels of Measurement Scales Direct Observation Direct observation is an approach to data collection in which subjects of the observation are in their natural environment. That is, there is little or no interaction between the subjects and the observer. Some examples are observing animals in the wild or people in public places. An advantage of this approach is that the subjects are not influenced by the data collection process, which helps ensure more reliable data. A disadvantage is lack of control over the sample, thus making it difficult to ensure that it is representative of the population of interest. James V. Lambers Statistical Data Analysis 20 / 76

21 Data Sources Levels of Measurement Scales Experiments A clinical trial for a new medication is an example of an experiment, which is another type of data source. In an experiment, unlike with direct observation, a statistician has more control over the makeup of the sample, to ensure that it is representative of the population of interest. On the other hand, because the participants are aware that data is being collected from them, they might (even unintentionally) be biased, thus influencing this data. James V. Lambers Statistical Data Analysis 21 / 76

22 Data Sources Levels of Measurement Scales Surveys In surveys, subjects are asked direct questions in order to produce the desired data. In this approach, it is essential to avoid two kinds of bias: bias due to the subjects not being a representative sample of the population, and bias due to the form of the questions being asked, which can substantially influence the data. James V. Lambers Statistical Data Analysis 22 / 76

23 Data Sources Levels of Measurement Scales Levels of Measurement Scales Now that we know of some sources from which data can be gathered, we need to also know about ways in which it can be measured, and the ramifications of each. James V. Lambers Statistical Data Analysis 23 / 76

24 Data Sources Levels of Measurement Scales Nominal Nominal measurement is a purely qualitative form of measurement, in which observations are assigned to categories, such as one s gender, occupation, or state of residence. It does not make sense to perform mathematical operations or comparisons of any kind on such measurements, even if the categories are labeled numerically (for example, zip codes). James V. Lambers Statistical Data Analysis 24 / 76

25 Data Sources Levels of Measurement Scales Ordinal The next step up from nominal measurement, on the spectrum from qualitative to quantitative, is ordinal measurement. Such measurements can be either qualitative or quantitative, and they can be ranked; examples would be the order of finish in a race, or the number of stars given to a movie by a critic as a rating. However, other mathematical operations do not make sense; for instance, one cannot claim that a movie that earns four stars is twice as good as a movie that earns two stars, or that the difference in quality between any 2-star movie and any 4-star movie is the same. James V. Lambers Statistical Data Analysis 25 / 76

26 Data Sources Levels of Measurement Scales Interval Interval measurements are purely quantitative, and can be added or subtracted. An example would be temperature, since differences in temperature measurements are meaningful. However, interval measurements cannot be multiplied or divided; that is, one hundred degrees is not considered twice as warm as fifty degrees. James V. Lambers Statistical Data Analysis 26 / 76

27 Data Sources Levels of Measurement Scales Ratio The most versatile form of measurement is ratio measurement. For such measurements, addition, subtraction, multiplication, division and comparison are valid. Examples of ratio measurement are age, weight, or salary. What distinguishes ratio measurements from interval measurements is that there is a zero point that makes ratios have meaning. A useful rule of thumb is the twice as much rule: if doubling a measurement has a consistent meaning, then the measurement is a ratio measurement rather than an interval measurement. James V. Lambers Statistical Data Analysis 27 / 76

28 Frequency Distributions Stem-and-Leaf Displays Charts Frequency Distributions A frequency distribution is a table that lists specific intervals, called classes, along with the number of data observations that fall into each class. The number of observations belonging to a particular class is called a frequency. James V. Lambers Statistical Data Analysis 28 / 76

29 Frequency Distributions Stem-and-Leaf Displays Charts Example Suppose that a survey of 100 voters is taken, in which the age of each respondent is recorded. The ages of the respondents are James V. Lambers Statistical Data Analysis 29 / 76

30 Frequency Distributions Stem-and-Leaf Displays Charts Example, cont d Since voters must be at least 18 years of age, classes could be chosen as follows: 18-27, 28-37, and so on, up to 78-87, since the maximum age among all respondents is 86. Then, the frequency distribution is Age Range Number of Respondents Frequency distribution of ages of 100 voters surveyed James V. Lambers Statistical Data Analysis 30 / 76

31 Frequency Distributions Stem-and-Leaf Displays Charts Frequency Distributions in R Suppose that the 100 ages from the preceding example are stored in a text file, called ages.txt, as a simple list of numbers separated by spaces. To create this frequency distribution in R, the following commands can be used: > ages=scan("ages.txt") > breaks = seq(min(ages),max(ages)+10,by=10) > freq = table(cut(ages,breaks,right=false)) > freq [18,28) [28,38) [38,48) [48,58) [58,68) [68,78) [78,88) James V. Lambers Statistical Data Analysis 31 / 76

32 Frequency Distributions Stem-and-Leaf Displays Charts Dissection of R Code In Windows, by default, R assumes that files are stored in your My Documents folder; otherwise, a full pathname should be specified as the argument to scan. The min and max functions return the minimum and maximum values, respectively, of their argument. The seq function returns a sequence of numbers with specified starting value, ending value, and spacing. In this case, 10 is added to the maximum value to ensure that it is included in a class. James V. Lambers Statistical Data Analysis 32 / 76

33 Frequency Distributions Stem-and-Leaf Displays Charts Dissection, cont d The cut function determines which class each element of its first argument belongs to, where the classes are specified by the second argument. The third argument right=false is used to specify that the right endpoint of each class is not included in the class. Finally, the freq function generates the frequency distribution from the output of cut. James V. Lambers Statistical Data Analysis 33 / 76

34 Frequency Distributions Stem-and-Leaf Displays Charts Class Selection In determining the classes for a frequency distribution, the following guidelines should be observed: All classes should be of equal size, so that the number of observations in each class can be compared in a meaningful way. There should be between 5 and 15 classes. Using too few classes fails to give a sense of the distribution of observations, and having too many classes makes comparing classes less useful. Classes should not be open-ended, if possible. For example, if observations are ages, there should not be a class of over age 50. Classes should be exhaustive, so that all data observations can be included. Note that the frequency distribution in the preceding example follows these guidelines; had classes spanned 20 years instead of 10, there would have been too few. James V. Lambers Statistical Data Analysis 34 / 76

35 Frequency Distributions Stem-and-Leaf Displays Charts Variations Some variations on a frequency distribution are: A relative frequency distribution, all frequencies are divided by the total number of observations, in order to obtain the percentage of observations in each class. As before, classes should be exhaustive, so that the total of all relative frequencies is 100%. A cumulative frequency distribution lists, for each class, the percentage of observations that are less than or equal the values in the class. A histogram is a bar graph in which the height of each bar is the number of observations in a class. James V. Lambers Statistical Data Analysis 35 / 76

36 Frequency Distributions Stem-and-Leaf Displays Charts Histograms A histogram can easily be created in R, using the hist command. For example, from the age data used in previous examples, the command hist(ages) produces the histogram shown on the next slide. With this simple usage of hist, the classes are chosen automatically; a second argument, breaks, can be used to specify the classes manually. For example, hist(ages, breaks=c(18,27.5,37.5,47.5,57.5,67.5,77.5,87)) produces a histogram that conforms to the frequency distribution given in the preceding example. James V. Lambers Statistical Data Analysis 36 / 76

37 Frequency Distributions Stem-and-Leaf Displays Charts Histogram Example Histogram of age data produced in R James V. Lambers Statistical Data Analysis 37 / 76

38 Frequency Distributions Stem-and-Leaf Displays Charts Stem-and-Leaf Display A stem-and-leaf display is a table for displaying integer-valued observations in which each observation is decomposed into a leaf, which is the ones digit, and a stem, which consists of the rest of the digits. The display consists of two columns; the left column lists stems and the right column lists all leaves with their corresponding stems. An advantage of using a stem-and-leaf display is that all of the original observations are actually visible in the display, as opposed to a frequency distribution that only lists the number of observations that fall within each class. James V. Lambers Statistical Data Analysis 38 / 76

39 Frequency Distributions Stem-and-Leaf Displays Charts Stem-and-Leaf Display of Age Data James V. Lambers Statistical Data Analysis 39 / 76

40 Frequency Distributions Stem-and-Leaf Displays Charts Pie Charts A pie chart is a circle divided into sectors, that are associated with classes. The central angle of each sector is equal to the relative frequency of the corresponding class, multiplied by 360 degrees. As a result, the size of each sector is indicative of the relative frequency of each class. It is best to also use colors to distinguish the classes. A pie chart for the age data used in previous examples is shown on the next slide. It is generated using the R command pie(freq) where freq is the frequency distribution generated earlier. James V. Lambers Statistical Data Analysis 40 / 76

41 Frequency Distributions Stem-and-Leaf Displays Charts Pie Chart Example Pie chart generated from frequency distribution of age data James V. Lambers Statistical Data Analysis 41 / 76

42 Frequency Distributions Stem-and-Leaf Displays Charts Bar Charts A bar chart is like a histogram, except that the height of each bar is determined by a specific data value, rather than the frequency of a class. Thus, a bar chart is used to highlight the actual values in the data set, as opposed to a pie chart, which highlights the relative sizes of classes. The bar chart shown on the next slide is generated in R from the age data using the command barplot(sort(ages)) James V. Lambers Statistical Data Analysis 42 / 76

43 Frequency Distributions Stem-and-Leaf Displays Charts Bar Chart Example Bar chart generated from sorted age data James V. Lambers Statistical Data Analysis 43 / 76

44 Frequency Distributions Stem-and-Leaf Displays Charts Line Charts A line chart is useful for illustrating a relationship between two sets of data, particularly when there is a large number of observations. Observations are plotted as points on the chart, and the x- and y-coordinates of the points are obtained from the observations of each data set. The points are then connected to help depict the relationship between the sets. James V. Lambers Statistical Data Analysis 44 / 76

45 Mean Median Mode Choosing a Measure It is highly desirable to be able to characterize a data set using a single value. Suppose that a data set consists of numerical values, and that the observations are plotted as points on the real number line. Then, a number that is at the center of these points can serve as such a characterizing value. This value is called a measure of central tendency. James V. Lambers Statistical Data Analysis 45 / 76

46 Mean Median Mode Choosing a Measure Mean Given a set of n numerical observations {x 1, x 2,..., x n } of a population, the mean of the set is µ = x 1 + x x n. n When the observations are drawn from a sample, rather than an entire population, then the mean is denoted by x: x = x 1 + x x n. n The mean can be defined more concisely using sigma notation: µ = 1 n n x i. i=1 James V. Lambers Statistical Data Analysis 46 / 76

47 Mean Median Mode Choosing a Measure The Mean in R To compute the mean of a data set in R, the mean function can be used. For example, with the age data used in previous example, we have: > mean(ages) [1] James V. Lambers Statistical Data Analysis 47 / 76

48 Mean Median Mode Choosing a Measure Weighted Mean In some instances, a measure of central tendency needs to be computed from the values in a data set, in which some values should be assigned more weights than others. This leads to the notion of a weighted mean µ = w 1x 1 + w 2 x w n x n w 1 + w w n = The weights must all be positive. n w i x i i=1. n w i i=1 James V. Lambers Statistical Data Analysis 48 / 76

49 Mean Median Mode Choosing a Measure Example Suppose that an overall course grade is computed by weighting a homework average h by 10%, two test grades t 1 and t 2 by 25% each, and a final exam f by 40%. Then the overall grade is 10h + 25t t f James V. Lambers Statistical Data Analysis 49 / 76

50 Mean Median Mode Choosing a Measure Weighted Mean in R To compute a weighted mean in R, the weighted.mean function can be used. The first argument is a vector of observations, and the second argument is a vector of weights. For example, suppose the homework average is 80, the test scores are 75 and 85, and the final exam score is 90. Then, the weighted mean is > grades <- c(80,75,85,90) > weighted.mean(grades,c(10,25,25,50)) [1] James V. Lambers Statistical Data Analysis 50 / 76

51 Mean Median Mode Choosing a Measure Mean of Grouped Data When data observations are summarized in a frequency distribution, an approximation of their mean can readily be obtained. Suppose that the frequency distribution has n classes, with frequencies f 1, f 2,..., f n. Furthermore, suppose that the ith class has a representative value c i ; for example, it could be the average of the lower and upper bounds of the class. James V. Lambers Statistical Data Analysis 51 / 76

52 Mean Median Mode Choosing a Measure Approximating the Mean Then an approximation of the mean is µ = n c i f i i=1. n f i i=1 It follows that if each class contains only a single value, then this approximate mean is given by a weighted mean of these values, in which the frequencies are the weights. James V. Lambers Statistical Data Analysis 52 / 76

53 Mean Median Mode Choosing a Measure Example Consider the frequency distribution of age data given earlier. The classes are age ranges 18-27, 28-37, and so on. If we average the upper and lower bounds of each class, we obtain representative values of the classes. In R, this can be accomplished using the following statements, and the breaks variable that was defined earlier. > breaks [1] > class midpoints=(breaks[1:7]+(breaks[2:8]-1))/2 > class midpoints [1] James V. Lambers Statistical Data Analysis 53 / 76

54 Mean Median Mode Choosing a Measure Vectors in R Note that components of a vector are accessed using indices enclosed in square brackets, and that the first component of each vector has the index of 1. Also, a contiguous portion of a vector can be extracted by specifiying a range of indices with a colon. For example, breaks[1:7] is a vector consisting of the first 7 elements, numbered 1 through 7, of breaks. James V. Lambers Statistical Data Analysis 54 / 76

55 Mean Median Mode Choosing a Measure Example, cont d Now, an approximate mean can be computed using (52): > sum(class midpoints*freq)/sum(freq) [1] 52.5 Note that this approximation is very close to the actual mean of Also, note that vectors of the same length can be multiplied; the result is a vector of products of corresponding components of the vectors. Then, sum can be used to compute the sum of all of the components of a vector. James V. Lambers Statistical Data Analysis 55 / 76

56 Mean Median Mode Choosing a Measure Median The median of a data set is, informally, the value such that half of the values in the set are less than the median, and half are greater than the median. Specifically, if the number n of observations in the set is odd, then the median is the middle value of the set, at position (n + 1)/2, if the values are sorted. If n is even, then the median is defined to the average of the values at positions n/2 and n/ The median function in R can be used to compute the median of a vector of observations. For example, using the age data, we have > median(ages) [1] 52.5 James V. Lambers Statistical Data Analysis 56 / 76

57 Mean Median Mode Choosing a Measure Mode The mode of a data set is the value that occurs most often within the set. It is possible for a data set to have more than one mode. There is no function in R for computing the mode, but if v is a vector containing all of the values of a data set, the following statements can be used to find its modes. > vtable=table(v) > where <- vtable==max(vtable) > names(vtable)[where] James V. Lambers Statistical Data Analysis 57 / 76

58 Mean Median Mode Choosing a Measure Code Dissection The first statement vtable=table(v) creates a one-row table from v, in which the data values of v are the header names of the columns in vtable, and the values in the one row of vtable are the counts of those values in v. The second statement where <- vtable==max(vtable) finds the indices within the table at which the counts are equal to the maximum. The variable where is a logical vector, with the same number of elements as there are distinct values in v. Each element of where is TRUE if the count of the corresponding value is equal to the maximum, and FALSE otherwise. James V. Lambers Statistical Data Analysis 58 / 76

59 Mean Median Mode Choosing a Measure Code Dissection, cont d The third statement names(vtable)[where] uses the names function to extract the column names from vtable, which are also the distinct values in the original data set in v. Then, the subscript [where] extracts only those column names in which the corresponding counts are equal to the maximum, which are the modes. James V. Lambers Statistical Data Analysis 59 / 76

60 Mean Median Mode Choosing a Measure Choosing a Measure Given these three measure of central tendency, it is natural to ask which one should be used. The mean can be skewed if the data set contains outliers, thus making it an unreliable measure. The median, on the other hand, is not susceptible to such bias. Finally, the mode is not often used, except with nominal data, which cannot be compared or added anyway. James V. Lambers Statistical Data Analysis 60 / 76

61 Range Variance Standard Deviation Quartiles A measure of central tendency is quite limited in its ability to describe a data set. For example, the values may be clustered closely around the mean or median, or they may be widely spread out. As such, we can use a measure of dispersion that describes how far individual data values deviate from a measure of central tendency. James V. Lambers Statistical Data Analysis 61 / 76

62 Range Variance Standard Deviation Quartiles Range The range of a set of data observations is simply the difference between the largest and smallest values. This measure of dispersion has the advantage that it is very easy to compute. However, it uses very little of the data, and is unduly influenced by outliers. The range function in R can be used to obtain the range of a set of observations. > range(ages) [1] James V. Lambers Statistical Data Analysis 62 / 76

63 Range Variance Standard Deviation Quartiles Population Variance The variance of a population, denoted by σ 2, is obtained from the deviation of each observation from the mean: σ 2 = 1 N N (x j µ) 2. j=1 An equivalent formula, that is less tedious for larger populations, is σ 2 = 1 N xj 2 µ 2. N j=1 James V. Lambers Statistical Data Analysis 63 / 76

64 Range Variance Standard Deviation Quartiles Sample Variance The formula for the variance of a sample, denoted by s 2, is slightly different: s 2 = 1 N (x j x) 2. N 1 The division by (N 1) instead of N is intended to compensate for the tendency of the sample variance, when dividing by N, to underestimate the population variance. The var function in R computes the sample variance of a vector of observations that is given as an argument. j=1 James V. Lambers Statistical Data Analysis 64 / 76

65 Range Variance Standard Deviation Quartiles Standard Deviation For both a population and a sample, the standard deviation is the square root of the variance. That is, the standard deviation of a population is σ = 1 N (x j µ) N 2, j=1 whereas for a sample, we have s = 1 N (x j x) N 1 2. An advantage of the standard deviation over the variance, as a measure of dispersion, is that the standard deviation is measured using the same units as the original data. j=1 James V. Lambers Statistical Data Analysis 65 / 76

66 Range Variance Standard Deviation Quartiles Standard Deviation in R The sd function in R computes the sample standard deviation of a given vector of observations. For example, from the age data, we obtain > var(ages) [1] > sd(ages) [1] James V. Lambers Statistical Data Analysis 66 / 76

67 Range Variance Standard Deviation Quartiles Standard Deviation of Grouped Data For grouped data in a relative frequency distribution, with n classes, class values c j (for example, the midpoint of the values in the class), and relative frequencies f j, j = 1, 2,..., n, the population standard deviation can be computed as follows: n σ = cj 2f j µ 2. j=1 James V. Lambers Statistical Data Analysis 67 / 76

68 Range Variance Standard Deviation Quartiles Empirical Rule The empirical rule states that if the distribution of a set of observations is bell-shaped, meaning that the distribution is symmetric around the mean and decreases toward zero away from the mean, then approximately 68, 95, and 99.7 % of the observations fall within 1, 2, and 3 standard deviations of the mean, respectively. James V. Lambers Statistical Data Analysis 68 / 76

69 Range Variance Standard Deviation Quartiles Chebyshev s Theorem Another rule of thumb, that applies even to distributions that are not bell-shaped or symmetric, is Chebyshev s Theorem, which states that if k > 1, then at least ( 1 1 ) k 2 100% of the observations fall within k standard deviations of the mean. James V. Lambers Statistical Data Analysis 69 / 76

70 Range Variance Standard Deviation Quartiles Quartiles Another measure of dispersion is the use of quartiles, which are obtained by dividing a data set into four segments that, as much as possible, contain an equal number of observations. Just as the median is the middle value of the data set, the first quartile, denoted by Q 1, is the median of the lower half of the data, and the third quartile, denoted by Q 3, is the median of the upper half of the data. There are various ways of determining what constitutes the lower and upper halves; some statisticians include the median in these halves if it is an actual observation, but some do not. James V. Lambers Statistical Data Analysis 70 / 76

71 Range Variance Standard Deviation Quartiles Interquartile Range and Outliers Once the first and third quartiles are computed, the interquartile range, denoted by IQR, is defined by IQR = Q 3 Q 1. This value is used to measure the spread of the center half of data, and identify outliers. A rule of thumb is to classify any values less than Q 1 1.5IQR, or greater than Q IQR, as outliers. James V. Lambers Statistical Data Analysis 71 / 76

72 Range Variance Standard Deviation Quartiles Quartiles in R The following R statements illustrate the computation of Q 1, Q 3 and the IQR, in order: > quantile(ages,0.25) 25% > quantile(ages,0.75) 75% 66 > IQR(ages) [1] James V. Lambers Statistical Data Analysis 72 / 76

73 Range Variance Standard Deviation Quartiles Five-point Summary The five-point summary of a data set consists of the minimum value, Q 1, the median (also denoted by Q 2 ), Q 3, and the maximum value. It can be obtained using the summary function in R. For example, from the age data, we obtain > summary(ages) Min. 1st Qu. Median Mean 3rd Qu. Max James V. Lambers Statistical Data Analysis 73 / 76

74 Range Variance Standard Deviation Quartiles Box-and-Whisker Plot These measures can be used to construct a box-and-whisker plot, which displays the interquartile range and outliers. A box is drawn with opposing boundaries placed at Q 1 and Q 3, with a parallel line drawn within the box at the median. Then, perpendicular lines, which are the whiskers, are drawn from Q 1 to the minimum value, and from Q 3 to the maximum value. The length of the box is equal to IQR, and if the length of either of the whiskers is more than 1.5 times the width of the box, then the value at the end of the whisker is an outlier. James V. Lambers Statistical Data Analysis 74 / 76

75 Range Variance Standard Deviation Quartiles Box-and-Whisker Plots in R A box-and-whisker plot can be produced in R using the boxplot command. For example, the plot shown on the next slide is obtained from the age data used in earlier examples using the command boxplot(ages) James V. Lambers Statistical Data Analysis 75 / 76

76 Range Variance Standard Deviation Quartiles Box-and-Whisker Plot Example Box-and-whisker plot produced from age data James V. Lambers Statistical Data Analysis 76 / 76

Introduction to Statistics

Introduction to Statistics Introduction to Statistics Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing,

More information

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. Statistics is a field of study concerned with the data collection,

More information

CIVL 7012/8012. Collection and Analysis of Information

CIVL 7012/8012. Collection and Analysis of Information CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real

More information

STAT 200 Chapter 1 Looking at Data - Distributions

STAT 200 Chapter 1 Looking at Data - Distributions STAT 200 Chapter 1 Looking at Data - Distributions What is Statistics? Statistics is a science that involves the design of studies, data collection, summarizing and analyzing the data, interpreting the

More information

Section 3.2 Measures of Central Tendency

Section 3.2 Measures of Central Tendency Section 3.2 Measures of Central Tendency 1 of 149 Section 3.2 Objectives Determine the mean, median, and mode of a population and of a sample Determine the weighted mean of a data set and the mean of a

More information

Introduction to Statistical Data Analysis Lecture 4: Sampling

Introduction to Statistical Data Analysis Lecture 4: Sampling Introduction to Statistical Data Analysis Lecture 4: Sampling James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis 1 / 30 Introduction

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data Review for Exam #1 1 Chapter 1 Population the complete collection of elements (scores, people, measurements, etc.) to be studied Sample a subcollection of elements drawn from a population 11 The Nature

More information

Chapter 4. Displaying and Summarizing. Quantitative Data

Chapter 4. Displaying and Summarizing. Quantitative Data STAT 141 Introduction to Statistics Chapter 4 Displaying and Summarizing Quantitative Data Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 31 4.1 Histograms 1 We divide the range

More information

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES INTRODUCTION TO APPLIED STATISTICS NOTES PART - DATA CHAPTER LOOKING AT DATA - DISTRIBUTIONS Individuals objects described by a set of data (people, animals, things) - all the data for one individual make

More information

CHAPTER 1. Introduction

CHAPTER 1. Introduction CHAPTER 1 Introduction Engineers and scientists are constantly exposed to collections of facts, or data. The discipline of statistics provides methods for organizing and summarizing data, and for drawing

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. Week 1 Chapter 1 Introduction What is Statistics? Why do you need to know Statistics? Technical lingo and concepts:

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

3.1 Measure of Center

3.1 Measure of Center 3.1 Measure of Center Calculate the mean for a given data set Find the median, and describe why the median is sometimes preferable to the mean Find the mode of a data set Describe how skewness affects

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

Units. Exploratory Data Analysis. Variables. Student Data

Units. Exploratory Data Analysis. Variables. Student Data Units Exploratory Data Analysis Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 13th September 2005 A unit is an object that can be measured, such as

More information

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics Last Lecture Distinguish Populations from Samples Importance of identifying a population and well chosen sample Knowing different Sampling Techniques Distinguish Parameters from Statistics Knowing different

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

Chapter 3. Data Description

Chapter 3. Data Description Chapter 3. Data Description Graphical Methods Pie chart It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partition a circle.

More information

Chapter 5: Exploring Data: Distributions Lesson Plan

Chapter 5: Exploring Data: Distributions Lesson Plan Lesson Plan Exploring Data Displaying Distributions: Histograms Interpreting Histograms Displaying Distributions: Stemplots Describing Center: Mean and Median Describing Variability: The Quartiles The

More information

Unit 2. Describing Data: Numerical

Unit 2. Describing Data: Numerical Unit 2 Describing Data: Numerical Describing Data Numerically Describing Data Numerically Central Tendency Arithmetic Mean Median Mode Variation Range Interquartile Range Variance Standard Deviation Coefficient

More information

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data Chapter 2: Summarising numerical data Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data Extract from Study Design Key knowledge Types of data: categorical (nominal and ordinal)

More information

AP Final Review II Exploring Data (20% 30%)

AP Final Review II Exploring Data (20% 30%) AP Final Review II Exploring Data (20% 30%) Quantitative vs Categorical Variables Quantitative variables are numerical values for which arithmetic operations such as means make sense. It is usually a measure

More information

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- #

Lecture Slides. Elementary Statistics Twelfth Edition. by Mario F. Triola. and the Triola Statistics Series. Section 3.1- # Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series by Mario F. Triola Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Review and Preview 3-2 Measures

More information

Lecture 1: Descriptive Statistics

Lecture 1: Descriptive Statistics Lecture 1: Descriptive Statistics MSU-STT-351-Sum 15 (P. Vellaisamy: MSU-STT-351-Sum 15) Probability & Statistics for Engineers 1 / 56 Contents 1 Introduction 2 Branches of Statistics Descriptive Statistics

More information

TOPIC: Descriptive Statistics Single Variable

TOPIC: Descriptive Statistics Single Variable TOPIC: Descriptive Statistics Single Variable I. Numerical data summary measurements A. Measures of Location. Measures of central tendency Mean; Median; Mode. Quantiles - measures of noncentral tendency

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

A is one of the categories into which qualitative data can be classified.

A is one of the categories into which qualitative data can be classified. Chapter 2 Methods for Describing Sets of Data 2.1 Describing qualitative data Recall qualitative data: non-numerical or categorical data Basic definitions: A is one of the categories into which qualitative

More information

Example 2. Given the data below, complete the chart:

Example 2. Given the data below, complete the chart: Statistics 2035 Quiz 1 Solutions Example 1. 2 64 150 150 2 128 150 2 256 150 8 8 Example 2. Given the data below, complete the chart: 52.4, 68.1, 66.5, 75.0, 60.5, 78.8, 63.5, 48.9, 81.3 n=9 The data is

More information

Determining the Spread of a Distribution

Determining the Spread of a Distribution Determining the Spread of a Distribution 1.3-1.5 Cathy Poliak, Ph.D. cathy@math.uh.edu Department of Mathematics University of Houston Lecture 3-2311 Lecture 3-2311 1 / 58 Outline 1 Describing Quantitative

More information

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected What is statistics? Statistics is the science of: Collecting information Organizing and summarizing the information collected Analyzing the information collected in order to draw conclusions Two types

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Determining the Spread of a Distribution

Determining the Spread of a Distribution Determining the Spread of a Distribution 1.3-1.5 Cathy Poliak, Ph.D. cathy@math.uh.edu Department of Mathematics University of Houston Lecture 3-2311 Lecture 3-2311 1 / 58 Outline 1 Describing Quantitative

More information

Descriptive Statistics-I. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami Descriptive Statistics-I Dr Mahmoud Alhussami Biostatistics What is the biostatistics? A branch of applied math. that deals with collecting, organizing and interpreting data using well-defined procedures.

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics CHAPTER OUTLINE 6-1 Numerical Summaries of Data 6- Stem-and-Leaf Diagrams 6-3 Frequency Distributions and Histograms 6-4 Box Plots 6-5 Time Sequence Plots 6-6 Probability Plots Chapter

More information

Statistics, continued

Statistics, continued Statistics, continued Visual Displays of Data Since numbers often do not resonate with people, giving visual representations of data is often uses to make the data more meaningful. We will talk about a

More information

are the objects described by a set of data. They may be people, animals or things.

are the objects described by a set of data. They may be people, animals or things. ( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r 2016 C h a p t e r 5 : E x p l o r i n g D a t a : D i s t r i b u t i o n s P a g e 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms

More information

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things. (c) Epstein 2013 Chapter 5: Exploring Data Distributions Page 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms Individuals are the objects described by a set of data. These individuals

More information

Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval

Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval Inf1-DA 2010 20 III: 28 / 89 Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency The word average: is very ambiguous and can actually refer to the mean, median, mode or midrange. Notation:

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:

More information

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?!

Topic 3: Introduction to Statistics. Algebra 1. Collecting Data. Table of Contents. Categorical or Quantitative? What is the Study of Statistics?! Topic 3: Introduction to Statistics Collecting Data We collect data through observation, surveys and experiments. We can collect two different types of data: Categorical Quantitative Algebra 1 Table of

More information

Describing Distributions with Numbers

Describing Distributions with Numbers Topic 2 We next look at quantitative data. Recall that in this case, these data can be subject to the operations of arithmetic. In particular, we can add or subtract observation values, we can sort them

More information

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population . Measures of Central Tendency: Mode, Median and Mean Average a single number that is used to describe the entire sample or population. Mode a. Easiest to compute, but not too stable i. Changing just one

More information

Section 3. Measures of Variation

Section 3. Measures of Variation Section 3 Measures of Variation Range Range = (maximum value) (minimum value) It is very sensitive to extreme values; therefore not as useful as other measures of variation. Sample Standard Deviation The

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Describing Distributions with Numbers

Describing Distributions with Numbers Describing Distributions with Numbers Using graphs, we could determine the center, spread, and shape of the distribution of a quantitative variable. We can also use numbers (called summary statistics)

More information

MATH 117 Statistical Methods for Management I Chapter Three

MATH 117 Statistical Methods for Management I Chapter Three Jubail University College MATH 117 Statistical Methods for Management I Chapter Three This chapter covers the following topics: I. Measures of Center Tendency. 1. Mean for Ungrouped Data (Raw Data) 2.

More information

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the

CHAPTER 4 VARIABILITY ANALYSES. Chapter 3 introduced the mode, median, and mean as tools for summarizing the CHAPTER 4 VARIABILITY ANALYSES Chapter 3 introduced the mode, median, and mean as tools for summarizing the information provided in an distribution of data. Measures of central tendency are often useful

More information

Statistics for Managers using Microsoft Excel 6 th Edition

Statistics for Managers using Microsoft Excel 6 th Edition Statistics for Managers using Microsoft Excel 6 th Edition Chapter 3 Numerical Descriptive Measures 3-1 Learning Objectives In this chapter, you learn: To describe the properties of central tendency, variation,

More information

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable QUANTITATIVE DATA Recall that quantitative (numeric) data values are numbers where data take numerical values for which it is sensible to find averages, such as height, hourly pay, and pulse rates. UNIVARIATE

More information

Chapter 01 : What is Statistics?

Chapter 01 : What is Statistics? Chapter 01 : What is Statistics? Feras Awad Data: The information coming from observations, counts, measurements, and responses. Statistics: The science of collecting, organizing, analyzing, and interpreting

More information

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore

Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore Math 223 Lecture Notes 3/15/04 From The Basic Practice of Statistics, bymoore Chapter 3 continued Describing distributions with numbers Measuring spread of data: Quartiles Definition 1: The interquartile

More information

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables)

(quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables) 3. Descriptive Statistics Describing data with tables and graphs (quantitative or categorical variables) Numerical descriptions of center, variability, position (quantitative variables) Bivariate descriptions

More information

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1

Lecture Slides. Elementary Statistics Tenth Edition. by Mario F. Triola. and the Triola Statistics Series. Slide 1 Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Overview 3-2 Measures

More information

Lecture 11. Data Description Estimation

Lecture 11. Data Description Estimation Lecture 11 Data Description Estimation Measures of Central Tendency (continued, see last lecture) Sample mean, population mean Sample mean for frequency distributions The median The mode The midrange 3-22

More information

STT 315 This lecture is based on Chapter 2 of the textbook.

STT 315 This lecture is based on Chapter 2 of the textbook. STT 315 This lecture is based on Chapter 2 of the textbook. Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan and Dr. Parthanil Roy for allowing him to use/edit some of their

More information

104 Business Research Methods - MCQs

104 Business Research Methods - MCQs 104 Business Research Methods - MCQs 1) Process of obtaining a numerical description of the extent to which a person or object possesses some characteristics a) Measurement b) Scaling c) Questionnaire

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

The science of learning from data.

The science of learning from data. STATISTICS (PART 1) The science of learning from data. Numerical facts Collection of methods for planning experiments, obtaining data and organizing, analyzing, interpreting and drawing the conclusions

More information

Chapter 1:Descriptive statistics

Chapter 1:Descriptive statistics Slide 1.1 Chapter 1:Descriptive statistics Descriptive statistics summarises a mass of information. We may use graphical and/or numerical methods Examples of the former are the bar chart and XY chart,

More information

Probabilities and Statistics Probabilities and Statistics Probabilities and Statistics

Probabilities and Statistics Probabilities and Statistics Probabilities and Statistics - Lecture 8 Olariu E. Florentin April, 2018 Table of contents 1 Introduction Vocabulary 2 Descriptive Variables Graphical representations Measures of the Central Tendency The Mean The Median The Mode Comparing

More information

Math 140 Introductory Statistics

Math 140 Introductory Statistics Math 140 Introductory Statistics Professor Silvia Fernández Chapter 2 Based on the book Statistics in Action by A. Watkins, R. Scheaffer, and G. Cobb. Visualizing Distributions Recall the definition: The

More information

Math 140 Introductory Statistics

Math 140 Introductory Statistics Visualizing Distributions Math 140 Introductory Statistics Professor Silvia Fernández Chapter Based on the book Statistics in Action by A. Watkins, R. Scheaffer, and G. Cobb. Recall the definition: The

More information

Introduction to Statistical Data Analysis Lecture 3: Probability Distributions

Introduction to Statistical Data Analysis Lecture 3: Probability Distributions Introduction to Statistical Data Analysis Lecture 3: Probability Distributions James V. Lambers Department of Mathematics The University of Southern Mississippi James V. Lambers Statistical Data Analysis

More information

Range The range is the simplest of the three measures and is defined now.

Range The range is the simplest of the three measures and is defined now. Measures of Variation EXAMPLE A testing lab wishes to test two experimental brands of outdoor paint to see how long each will last before fading. The testing lab makes 6 gallons of each paint to test.

More information

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics

More information

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data Mean 26.86667 Standard Error 2.816392 Median 25 Mode 20 Standard Deviation 10.90784 Sample Variance 118.981 Kurtosis -0.61717 Skewness

More information

Statistics I Chapter 2: Univariate data analysis

Statistics I Chapter 2: Univariate data analysis Statistics I Chapter 2: Univariate data analysis Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram,

More information

P8130: Biostatistical Methods I

P8130: Biostatistical Methods I P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data

More information

1 Measures of the Center of a Distribution

1 Measures of the Center of a Distribution 1 Measures of the Center of a Distribution Qualitative descriptions of the shape of a distribution are important and useful. But we will often desire the precision of numerical summaries as well. Two aspects

More information

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27 Chapter 7: Statistics Describing Data Chapter 7: Statistics Describing Data 1 / 27 Categorical Data Four ways to display categorical data: 1 Frequency and Relative Frequency Table 2 Bar graph (Pareto chart)

More information

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Preliminary Statistics course. Lecture 1: Descriptive Statistics Preliminary Statistics course Lecture 1: Descriptive Statistics Rory Macqueen (rm43@soas.ac.uk), September 2015 Organisational Sessions: 16-21 Sep. 10.00-13.00, V111 22-23 Sep. 15.00-18.00, V111 24 Sep.

More information

Lecture 1: Description of Data. Readings: Sections 1.2,

Lecture 1: Description of Data. Readings: Sections 1.2, Lecture 1: Description of Data Readings: Sections 1.,.1-.3 1 Variable Example 1 a. Write two complete and grammatically correct sentences, explaining your primary reason for taking this course and then

More information

Let's Do It! What Type of Variable?

Let's Do It! What Type of Variable? Ch Online homework list: Describing Data Sets Graphical Representation of Data Summary statistics: Measures of Center Box Plots, Outliers, and Standard Deviation Ch Online quizzes list: Quiz 1: Introduction

More information

BNG 495 Capstone Design. Descriptive Statistics

BNG 495 Capstone Design. Descriptive Statistics BNG 495 Capstone Design Descriptive Statistics Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential statistical methods, with a focus

More information

Sociology 6Z03 Review I

Sociology 6Z03 Review I Sociology 6Z03 Review I John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review I Fall 2016 1 / 19 Outline: Review I Introduction Displaying Distributions Describing

More information

Unit 2: Numerical Descriptive Measures

Unit 2: Numerical Descriptive Measures Unit 2: Numerical Descriptive Measures Summation Notation Measures of Central Tendency Measures of Dispersion Chebyshev's Rule Empirical Rule Measures of Relative Standing Box Plots z scores Jan 28 10:48

More information

Statistics I Chapter 2: Univariate data analysis

Statistics I Chapter 2: Univariate data analysis Statistics I Chapter 2: Univariate data analysis Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram,

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific number

More information

Descriptive Univariate Statistics and Bivariate Correlation

Descriptive Univariate Statistics and Bivariate Correlation ESC 100 Exploring Engineering Descriptive Univariate Statistics and Bivariate Correlation Instructor: Sudhir Khetan, Ph.D. Wednesday/Friday, October 17/19, 2012 The Central Dogma of Statistics used to

More information

Using Dice to Introduce Sampling Distributions Written by: Mary Richardson Grand Valley State University

Using Dice to Introduce Sampling Distributions Written by: Mary Richardson Grand Valley State University Using Dice to Introduce Sampling Distributions Written by: Mary Richardson Grand Valley State University richamar@gvsu.edu Overview of Lesson In this activity students explore the properties of the distribution

More information

ST 371 (IX): Theories of Sampling Distributions

ST 371 (IX): Theories of Sampling Distributions ST 371 (IX): Theories of Sampling Distributions 1 Sample, Population, Parameter and Statistic The major use of inferential statistics is to use information from a sample to infer characteristics about

More information

Section 1.1. Data - Collections of observations (such as measurements, genders, survey responses, etc.)

Section 1.1. Data - Collections of observations (such as measurements, genders, survey responses, etc.) Section 1.1 Statistics - The science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions based on the data.

More information

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe Class 5 Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science Copyright 2017 by D.B. Rowe 1 Agenda: Recap Chapter 3.2-3.3 Lecture Chapter 4.1-4.2 Review Chapter 1 3.1 (Exam

More information

SESSION 5 Descriptive Statistics

SESSION 5 Descriptive Statistics SESSION 5 Descriptive Statistics Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple

More information

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives F78SC2 Notes 2 RJRC Algebra It is useful to use letters to represent numbers. We can use the rules of arithmetic to manipulate the formula and just substitute in the numbers at the end. Example: 100 invested

More information

Quantitative Methods Chapter 0: Review of Basic Concepts 0.1 Business Applications (II) 0.2 Business Applications (III)

Quantitative Methods Chapter 0: Review of Basic Concepts 0.1 Business Applications (II) 0.2 Business Applications (III) Quantitative Methods Chapter 0: Review of Basic Concepts 0.1 Business Applications (II) 0.1.1 Simple Interest 0.2 Business Applications (III) 0.2.1 Expenses Involved in Buying a Car 0.2.2 Expenses Involved

More information

Statistics and parameters

Statistics and parameters Statistics and parameters Tables, histograms and other charts are used to summarize large amounts of data. Often, an even more extreme summary is desirable. Statistics and parameters are numbers that characterize

More information

Exam: practice test 1 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Exam: practice test 1 MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Exam: practice test MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Solve the problem. ) Using the information in the table on home sale prices in

More information

What are the mean, median, and mode for the data set below? Step 1

What are the mean, median, and mode for the data set below? Step 1 Unit 11 Review Analyzing Data Name Per The mean is the average of the values. The median is the middle value(s) when the values are listed in order. The mode is the most common value(s). What are the mean,

More information

1.3.1 Measuring Center: The Mean

1.3.1 Measuring Center: The Mean 1.3.1 Measuring Center: The Mean Mean - The arithmetic average. To find the mean (pronounced x bar) of a set of observations, add their values and divide by the number of observations. If the n observations

More information

1. Exploratory Data Analysis

1. Exploratory Data Analysis 1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be

More information

Chapter 1. Looking at Data

Chapter 1. Looking at Data Chapter 1 Looking at Data Types of variables Looking at Data Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions!! For example,

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations: Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Fall 213 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific

More information

Σ x i. Sigma Notation

Σ x i. Sigma Notation Sigma Notation The mathematical notation that is used most often in the formulation of statistics is the summation notation The uppercase Greek letter Σ (sigma) is used as shorthand, as a way to indicate

More information

Math 221, REVIEW, Instructor: Susan Sun Nunamaker

Math 221, REVIEW, Instructor: Susan Sun Nunamaker Math 221, REVIEW, Instructor: Susan Sun Nunamaker Good Luck & Contact me through through e-mail if you have any questions. 1. Bar graphs can only be vertical. a. true b. false 2.

More information

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter2 Description of samples and populations. 2.1 Introduction. Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that

More information