Unit 1 Summarizing Data

Size: px
Start display at page:

Download "Unit 1 Summarizing Data"

Transcription

1 PubHtlth 540 Fall Summarizing Page 1 of 54 Unit 1 Summarizing It is difficult to understand why statisticians commonly limit their enquiries to averages, and do not revel in more comprehensive views. Their souls seem as dull as the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once - Sir Francis Galton (England, ) This unit introduces variables and the variety of types of data possible (nominal, ordinal, interval, and ratio). Graphical and numerical ways of summarizing data are described. Graphs are encouraged and it is emphasized that these should be as simple and as clear as possible. Numerical summaries of data include those that describe central tendency (eg mode, mean, median), those that describe dispersion (eg range and standard deviation), and those that describe the shape of the distribution (eg 25 th and 75 th percentiles).

2 PubHtlth 540 Fall Summarizing Page 2 of 54 Table of Contents Topics 1. Unit Roadmap.. 2. Learning Objectives. 3. Variables and Types of. 4. Summaries for Qualitative. a. Frequency Table, Relative Frequency Table b. The Bar Chart c. The Pie Chart 5. Graphical Summaries for Quantitative... a. The Histogram b. The Frequency Polygon.. c. The Cumulative Frequency Polygon d. Percentiles (Quantiles) e. Five Number Summary f. Interquartile Range, IQR.. g. Quantile Quantile Plot.. h. Stem and Leaf Diagram.. i. Box and Whisker Plot The Summation Notation. 7. Numerical Summaries for Quantitative -Central Tendency a. The mode.. b. The mean.. c. The mean as a balancing point and skewness.. d. The mean of grouped data e. The median Numerical Summaries for Quantitative - Dispersion. a. Variance. b. Standard Deviation c. Median Absolute Deviation from Median. d. Standard Deviation v Standard Error. e. A Feel for Sampling Distributions f. The Coefficient of Variation. g. The Range

3 PubHtlth 540 Fall Summarizing Page 3 of Unit Roadmap / Populations Unit 1. Summarizing Observation and data are not the same. Think about it. What you record as data is only some of what you actually observe!! Thus, data are the result of selection. are the values you obtain by measurement. A variable is something whose value can vary. The purpose of summarizing data is to communicate the relevant aspects of the data. Tip Ask yourself, what is relevant? Goal - Aim for summaries that are simple, clear, and thorough. Tip Avoid misleading summaries. Relationships

4 PubHtlth 540 Fall Summarizing Page 4 of Learning Objectives When you have finished this unit, you should be able to:! Explain the distinction between variable and data value.! Explain the distinction between qualitative and quantitative data.! Identify the type of variable represented by a variable and its data values.! Understand and know when and how to construct bar charts. Tip! -Pie charts are not recommended.! Understand and know how to compute: percentile, five number summary, and interquartile range, IQR.! Understand and know when and how to construct the following graphical summaries for quantitative data: histograms, frequency and cumulative frequency polygons, quantilequantile plot, stem and leaf diagrams, and box and whisker plots.! Understand and know how to compute summary measures of central tendency: mode, mean, median.! Understand and know how to compute other summary measures of dispersion: range, interquartile range, standard deviation, sample variance, standard error.! Understand somewhat the distinction between standard deviation and standard error Note We will discuss this again in Unit 2 (Introduction to Probability) and in Unit 3 (Populations and s).! Understand the importance of the type of data and the shape of the data distribution when choosing which data summary to obtain.

5 PubHtlth 540 Fall Summarizing Page 5 of Variables and Types of can be of different types, and it matters Variables versus A variable is something whose value can vary. It is a characteristic that is being measured. Examples of variables are: - AGE - SEX - BLOOD TYPE A data value is the realization (a number or text response) that you obtain upon measurement. Examples of data values are: - 54 years - female - A Consider the following little data set that is stored in a spreadsheet: Variables are the column headings subject, age, sex, bloodtype subject age sex bloodtype 1 54 female A 2 32 male B 3 24 female AB values are the table cell entries 54, female, A, etc. This data table (spreadsheet) has three observations (rows), four variables, and 12 data values.

6 PubHtlth 540 Fall Summarizing Page 6 of 54 The different data types are distinct because the scales of measurement are distinct. There are a variety of schemes for organizing distinct data types. Nevertheless, they all capture the point that differences in scale of measurement are what distinguish distinct data types.! Daniel, Wayne W ( Biostatistics A Foundation for Analysis in the Health Sciences ) classifies data types as follows: All Types Qualitative attributes Quantitative numbers nominal ordinal discrete continuous Values are names that are unordered categories Values are names that are ordered categories Values are integer values 0, 1, 2 on a proper numeric scale Values are a measured number of units, including possible decimal values, on a continuous scale No units No units Counted units Measured units Example: Example: Example: Example: GENDER PAIN LEVEL NUMBER OF VISITS WEIGHT Male, female mild, moderate, severe 0, 1, 2, etc lbs, etc.

7 PubHtlth 540 Fall Summarizing Page 7 of 54 The distinction between qualitative versus quantitative is straightforward: Qualitative: Attributes, characteristics that cannot be reported as numbers Quantative: Numbers Example - To describe a flower as pretty is a qualitative assessment while to record a child s age as 11 years is a quantitative measurement. Consider this We can reasonably refer to the child s 22 year old cousin as being twice as old as the child whereas we cannot reasonably describe an orchid as being twice as pretty as a dandelion. We encounter similar stumbling blocks in statistical work. Depending on the type of the variable, its scale of measurement type, some statistical methods are meaningful while others are not. QUALITATIVE Nominal Scale: Values are names which cannot be ordered. Example: Cause of Death Cancer Heart Attack Accident Other Example: Gender Male Female Example: Race/Ethnicity Black White Latino Other Other Examples: Eye Color, Type of Car, University Attended, Occupation

8 PubHtlth 540 Fall Summarizing Page 8 of 54 QUALITATIVE Ordinal Scale: Values are attributes (names) that are naturally ordered. Example: Size of Container Small Medium Large Example: Pain Level None Mild Moderate Severe For analysis in the computer, both nominal and ordinal data might be stored using numbers rather than text. Example of nominal: Race/Ethnicity 1 = Black 2 = White 3 = Latino 4 = Other Example of ordinal: Pain Level 1 = None 2 = Mild 3 =Moderate 4 = Severe Nominal - The numbers have NO meaning They are labels ONLY Ordinal The numbers have LIMITED meaning 4 > 3 > 2 > 1 is all we know apart from their utility as labels.

9 PubHtlth 540 Fall Summarizing Page 9 of 54 QUANTITATIVE Discrete Scale: Values are counts of the number of times some event occurred and are thus whole numbers: 0, 1, 2, 3, etc. Examples: Number of children a woman has had Number of clinic visits made in one year The numbers are meaningful. We can actually compute with these numbers. QUANTITATIVE Continuous: A further classification of data types is possible for quantitative data that are continuous continuous Interval Ratio QUANTITATIVE Continuous " Interval ( no true zero ): Continuous interval data are generally measured on a continuum and differences between any two numbers on the scale are of known size but there is no true zero. Example: Temperature in F on 4 successive days Day: A B C D Temp F: degrees difference makes sense. For these data, not only is day A with 50 cooler than day D with 65, but it is 15 cooler. Also, day A is cooler than day B by the same amount that day C is cooler than day D (i.e., 5 ). 0 degrees cannot be interpreted as absence of temperature. In fact, we think of 0 degrees as quite cold! Or, we might think of it as the temperature at which molecules are no longer in motion. Either way, it s not the same as 0 apples or 0 Santa Claus. Thinking about mathematics, for data that are continuous and interval (such as temperature and time), the value 0" is arbitrary and doesn't reflect absence of the attribute.

10 PubHtlth 540 Fall Summarizing Page 10 of 54 QUANTITATIVE Continuous " Ratio ( meaningful zero ): Continuous ratio data are also measured on a meaningful continuum. The distinction is that ratio data have a meaningful zero point. Example: Weight in pounds of 6 individuals 136, 124, 148, 118, 125, 142 Note on meaningfulness of ratio - Someone who weighs 142 pounds is two times as heavy as someone else who weighs 71 pounds. This is true even if weight had been measured in kilograms. In the sections that follow, we will see that the possibilities for meaningful description (tables, charts, means, variances, etc) are lesser or greater depending on the scale of measurement. The chart on the next page gives a sense of this idea. For example, we ll see that we can compute relative frequencies for a nominal random variable (eg. Hair color: e.g. 7% of the population has red hair ) but we cannot make statements about cumulative relative frequency for a nominal random variable (eg. it would not make sense to say 35% of the population has hair color less than or equal to blonde )

11 PubHtlth 540 Fall Summarizing Page 11 of 54 Chart showing data summarization methods, by data type: All Types Qualitative attribute Quantitative number Type Descriptive Methods Numerical Summaries Nominal Ordinal Discrete Continuous Bar chart Bar chart Pie chart Pie chart Frequency Relative Frequency Frequency Relative Frequency Cumulative Frequency Bar chart Pie chart Dot diagram Scatter plot (2 variables) Stem-Leaf Histogram Box Plot Quantile-Quantile Plot Frequency Relative Frequency Cumulative Frequency means, variances, percentiles - - Dot diagram Scatter plot (2 vars) Stem-Leaf Histogram Box Plot Quantile-Quantile Plot means, variances, percentiles Note This table is an illustration only. It is not intended to be complete.

12 PubHtlth 540 Fall Summarizing Page 12 of Summaries for Qualitative ( attribute ) Example - Consider a study of 25 consecutive patients entering the general medical/surgical intensive care unit at a large urban hospital. For each patient the following data are collected: Variable Label (Variable) Age, years (AGE) Type of Admission (TYPE_ADM): ICU Type (ICU_TYPE): Code 1= Emergency 0= Elective 1= Medical 2= Surgical 3= Cardiac 4= Other Systolic Blood Pressure, mm Hg (SBP) Number of Days Spent in ICU (ICU_LOS) Vital Status at Hospital Discharge (VIT_STAT): 1= Dead 0= Alive The actual data are provided on the following page.

13 PubHtlth 540 Fall Summarizing Page 13 of 54 ID Age Type_Adm ICU_Type SBP ICU_LOS Vit_Stat Qualitative data: Type of Admission (Type_Adm) ICU Type (ICU_Type) Vital Status at Hospital Dicharge (Vit_Stat) Quantitative data: Age, years (Age) Number of days spent in ICU (ICU_LOS) Systolic blood pressure (SBP)

14 PubHtlth 540 Fall Summarizing Page 14 of 54 4a. Frequency Table, Relative Frequency Table A tally of the possible outcomes, together with how often and proportionately often is called a frequency and relative frequency distribution. Appropriate for - nominal, ordinal, count data types. For the variable ICU_Type, the frequency distribution is the following: ICU_Type Frequency ( how often ) Relative Frequency ( proportionately often ) Medical Surgical Cardiac Other TOTAL This summary will be useful in constructing two graphical displays, the bar chart and the pie chart. 4b. Bar Chart On the horizontal are the possible outcomes. Important In a bar chart, along the horizontal axis, the possible outcomes are separated by spaces. (This makes sense the spaces remind us that outcomes inbetween are not possible) On the vertical is plotted either how often - frequency how proportionately often - relative frequency Example - Bar chart for Type of ICU Patients (n=25) N 5 0 Medical Surgical Cardiac Other

15 PubHtlth 540 Fall Summarizing Page 15 of 54 Guidelines for Construction of Bar Charts Label both axes clearly Leave space between bars Leave space between the left-most bar and the vertical axis When possible, begin the vertical axis at 0 All bars should be the same width There s no reason why the bar chart can t be plotted horizontally instead of vertically. And, not surprisingly, if you change the choice of scale, you can communicate to the eye a very different message. Example, continued - Consider two choices of scale for the vertical axis in the bar chart for ICU_Type: N 5 0 Medical Surgical Cardiac Other N 10 0 Medical Surgical Cardiac Other

16 PubHtlth 540 Fall Summarizing Page 16 of 54 It s difficult to know how to construct a bar chart when the frequencies are very high. Example, continued - Suppose the frequencies were 1012 medical, 1006 surgical, 1005 cardiac and 1002 other type. Is it better to have a vertical axis start at zero or to break the axis? Vertical Axis Starts at Zero Vertical Axis has Break

17 PubHtlth 540 Fall Summarizing Page 17 of 54 4c. Pie Chart Instead of bars rising up from the horizontal (as in a bar chart), we could plot instead the shares of a pie. Recalling that a circle has 360 degrees, that 50% means 180 degrees, 25% means 90 degrees, etc, we can identify wedges according to relative frequency Relative Frequency Size of Wedge, in degrees % of 360 = 180 degrees % of 360 = 90 degrees p (p) x (100%) x 360 degrees Example, continued Medical Surgical Cardiac Other N Tip - Avoid pie charts! It is harder for the eye to compare pie slices than to compare bars. Also, the eye has to move around the circle, thus violating the rule of simplicity.

18 PubHtlth 540 Fall Summarizing Page 18 of Graphical Summaries for Quantitative Summaries for quantitative data address many things. Among the most important are: - What is typical (location) - What is the scatter (dispersion) Another important aspect of quantitative data, however, is the shape of the distribution of values. The following are 3 scenarios for the patterns of values (eg values of age) in a simple random sample (eg a simple random sample of cholesterol values where the sample size is n=25) The 3 patterns are quite different. The leftmost pattern is symmetric and bell shaped. The middle pattern has a tail to the right. The rightmost pattern is comprised of two symmetric bell shaped patterns that are separated in their location. Good choices for summarizing location and dispersion are not always the same and depend on the pattern of scatter.

19 PubHtlth 540 Fall Summarizing Page 19 of 54 5a. Histogram For a continuous variable (e.g. age), the frequency distribution of the individual ages is not so interesting. Age Frequency We see more in frequencies of age values in groupings. Here, 10 year groupings make sense. Age Interval Frequency TOTAL 25

20 PubHtlth 540 Fall Summarizing Page 20 of 54 A plot of this grouped frequency table gives us a better feel for the pattern of ages with respect to both location and scatter. This plot is called a histogram A histogram is a graphical summary of the pattern of values of a continuous random variable. More formally, it is a graphical summary of the frequency distribution It is analogous to the bar graph summary for the distribution of a discrete random variable Stata command:.histogram age, width(10) start(10) frequency addlabels xlabel(10 (10) 90) ylabel(0 (1) 10) title( Histogram of Age, years ) subtitle( ICU, n=25 )

21 PubHtlth 540 Fall Summarizing Page 21 of 54 How to Construct a Histogram Step 1: Choose the number of groupings ( class intervals ). Call this k. The choice is arbitrary. K too small over-summarizes. K too large under-summarizes. Sometimes, the choices of intervals are straightforward eg 10 year intervals, 7 day intervals, 30 day intervals. Some text books provide formulae for k. Use these if you like, provided the resulting k makes sense. Example For the age data, we use k=8 so that intervals are sensible 10 year spans. Step 2: Rules for interval beginning and end values ( boundaries ) Boundaries should be such that each observation has exactly one home. Equal widths are not necessary. WARNING If you choose to plot intervals that are of Unequal widths, take care to plot area proportional to relative frequency. This is explained in Step 3. Step 3: In a histogram, always plot Area Proportional to Relative Frequency Example Suppose the first two age intervals are combined: Age Interval Frequency TOTAL 25

22 PubHtlth 540 Fall Summarizing Page 22 of 54 For the intervals 30-39, 40-49, 50-59, 60-69, 70-79, 80-89: the widths are all the same and span 10 units of age. Heights plotted are 3, 0, 6, 1, 9, and 2. The new, combined, spans 20 units of age. A frequency of 4 over 20 units of age corresponds to a frequency of 2 over 10 units of age. For the interval 10-29, a height of 2 would be plotted. Histograms with bins of varying width are possible in Stata but require some additional coding that is beyond the scope of this course.

23 PubHtlth 540 Fall Summarizing Page 23 of 54 5b. Frequency Polygon The frequency polygon is an alternative to the histogram. It s not used often. Its companion, the cumulative frequency polygon (next page), is more commonly used. Both the histogram and frequency polygon are graphical summaries of the frequency distribution of a continuous random variable Whereas in a histogram X-axis shows intervals of values Y-axis shows bars of frequencies In a frequency Polygon: X-axis shows midpoints of intervals of values Y-axis shows dot instead of bars Some guidelines: i. The graph title should be a complete description of the graph ii. Clearly label both the horizontal and vertical axes iii. Break axes when necessary iv. Use equal class widths v. Be neat and accurate Example - The following frequency polygon is similar in interpretation to a histogram. Source:

24 PubHtlth 540 Fall Summarizing Page 24 of 54 5c. Cumulative Frequency Polygon As you might think, a cumulative frequency polygon depicts accumulating data. This is actually useful in that it lets us see percentiles (eg, median) Its plot utilizes cumulative frequency information (right hand columns below in blue) Cumulative through Interval Age Interval Frequency (count) Relative Frequency (%) Frequency (count) Relative Frequency (%) TOTAL Notice that it is the ENDPOINT of the interval that is plotted on the horizontal. This makes sense inasmuch as we are keeping track of the ACCUMULATION of frequencies to the end of the interval.

25 PubHtlth 540 Fall Summarizing Page 25 of 54 5d. Percentiles (Quantiles) Percentiles are one way to summarize the range and shape of values in a distribution. Percentile values communicate various cut-points. For example: Suppose that 50% of a cohort survived at least 4 years. This also means that 50% survived at most 4 years. We say 4 years is the median. The median is also called the 50 th percentile, or the 50 th quantile. We write P 50 = 4 years. Similarly we could speak of other percentiles: P 25 : 25% of the sample values are less than or equal to this value. P 75 : 75% of the sample values are less than or equal to this value. P 0 : The minimum. P 100 : The maximum. It is possible to estimate the values of percentiles from a cumulative frequency polygon. Example Consider P 10 = 18. It is interpreted as follows: 10% of the sample is age < 18 or The 10 th percentile of age in this sample is 18 years.

26 PubHtlth 540 Fall Summarizing Page 26 of 54 How to Determine the Values of Q1, Q2, Q3 the 25 th, 50 th, and 75 th Percentiles in a Set Often, it is the quartiles we re after. An easy solution for these is the following. Obtain the median of the entire sample. Then obtain the medians of each of the lower and upper halves of the distribution. Step 1 - Preliminary: Arrange the observations in your sample in order, from smallest to largest, with the smallest observation at the left. Step 2 Obtain median of entire sample: Solve first for the value of Q2 = 50 th percentile ( median): Q2 = 50 th Percentile ( median ) Q2 = Size is ODD n+1 2 th ordered observation Size is EVEN n Q2 = average 2, n 2 +1 st ordered observation Step 3 Q1 is the median of the lower half of the sample: To obtain the value of Q1 = 25 th percentile, solve for the median of the lower 50% of the sample. Step 4 Q3 is the median of the upper half of the sample: To obtain the value of Q3 = 75 th percentile, solve for the median of the upper 50% of the sample: Example Consider the following sample of n=7 data values Solution for Q2 Q2 = 50 th Percentile = th = 4 th ordered observation = 3.43 Solution for Q1 The lower 50% of the sample is thus, the following Q1 = 25 th Percentile = average 4 2, st = average(2.06, 2.36) = 2.21 Solution for Q3 The upper 50% of the sample is the following Q3 = 75 th Percentile = average st 4 2, = average(3.74, 3.78) = 3.76

27 PubHtlth 540 Fall Summarizing Page 27 of 54 How to determine the values of other Percentiles in a Sett Important Note Unfortunately, there exist multiple formulae for doing this calculation. Thus, there is no single correct method Consider the following sample of n=40 data values Step 1: Order the data from smallest to largest Step 2: Compute L = n p 100 where n = size of sample (eg; n=40 here) p = desired percentile (eg p=25 th ) L is NOT a whole number Step 3: Change L to next whole number. Pth percentile = Lth ordered value in the data set. L is a whole number Step 3: Pth percentile = average of the Lth and (L+1)st ordered value in the data set.

28 PubHtlth 540 Fall Summarizing Page 28 of 54 5e. Five Number Summary A five number summary of a set of data is, simply, a particular set of five percentiles: P 0 : The minimum value. P 25 : 25% of the sample values are less than or equal to this value. P 50 : The median. 50% of the sample values are less than or equal to this value. P 75 : 75% of the sample values are less than or equal to this value. P 100 : The maximum. Why bother? This choice of five percentiles is actually a good summary, since: The minimum and maximum identify the extremes of the distribution, and The 1 st and 3 rd quartiles identify the middle half of the data, and Altogether, the five percentiles are the values that define the quartiles of the distribution, and Within each interval defined by quartile values, there are an equal number of observations. Example, continued We re just about done since on page 26, the solution for P 25, P 50, and P 75 was shown. Here is the data again Thus, P 0 = the minimum value = 1.47 P 25 = 1 st quartile = 25 th percentile = 2.21 P 50 = 2 nd quartile = 50 th percentile (median) = 3.43 P 75 = 3 rd quartile = 75 th percentile = 3.76 P 100 = the maximum value = 3.94

29 PubHtlth 540 Fall Summarizing Page 29 of 54 5f. Interquartile Range (IQR) The interquartile range is simply the difference between the 1 st and 3 rd quartiles: IQR = Interquartile Range = [ P 75 -P 25 ] The IQR is a useful summary also: It is an alternative summary of dispersion (sometimes used instead of standard deviation) The range represented by the IQR tells you the spread of the middle 50% of the sample values Example, continued Here is the data again P 0 = the minimum value = 1.47 P 25 = 1 st quartile = 25 th percentile = 2.21 P 50 = 2 nd quartile = 50 th percentile (median) = 3.43 P 75 = 3 rd quartile = 75 th percentile = 3.76 P 100 = the maximum value = 3.94 IQR = Interquartile Range = [ P 75 -P 25 ] = [ ] = 1.55

30 PubHtlth 540 Fall Summarizing Page 30 of 54 5g. Quantile-Quantile (QQ) and Percentile-Percentile (PP) Plots We might ask: Is the distribution of my data normal? QQ and PP plots are useful when we want to compare the percentiles of our data with the percentiles of some reference distribution (eg- reference is normal) QQ Plot: X= quantile in sample Y=quantile in reference Desired Quantile 10th 20th 99 th X = Value in of Y = Value in Reference Distribution PP Plot: X= percentile rank in sample Y= percentile rank in reference value ## ## ## X = Rank (percentile) in of Y = Rank (percentile) in Reference Distribution What to look for: A straight line suggests that the two distributions are the same Example Normal Quantile-Quantile Plot of Age for data on page 13 Stata command:. qnorm age, title( Normal Q-Q Plot of Age ) subtitle( n=25 ) caption( mean=56.92 sd= )

31 PubHtlth 540 Fall Summarizing Page 31 of 54 5h. Stem and Leaf Diagram A stem and leaf diagram is a quick and dirty histogram. It is also a quick and easy way to sort data. Each actual data point is de-constructed into a stem and a leaf. There are a variety of ways to do this, depending on the sample size and range of values.! For example, the data value 84 might be de-constructed as follows 8 4 Stem = 8 Leaf = 4 The value of the 10 s place is 8 The value of the unit s place is 4 Example - Stem and Leaf Plot of Age of 25 ICU Patients: The stems are to the left of the vertical. In this example, each value of stem represents a multiple of 10. The leaves are to the right of the vertical. 15 is stem=1 + leaf=5 65 is stem=6 + leaf=5 85 is stem=8 + leaf=5 Key - Among other things, we see that the minimum age in this sample is 15 years (Note the unit=5 in the row for stem=1) and the maximum age is 85 years (Unit=5 in the row for stem=8). We can also see that most of the subjects in this sample are in their seventies (Actual ages are 71, 74, 75, 75, 76, 76, 77, 78, 79).

32 PubHtlth 540 Fall Summarizing Page 32 of 54 5i. Box and Whisker Plot The box and whisker plot (also called box plot) is a wonderful schematic summary of the distribution of values in a data set. It shows you a number of features, including: extremes, 25 th and 75 th percentile, median and sometimes the mean. Side-by-side box and whisker plots are a wonderful way to compare multiple distributions. Definition The central box spans the interquartile range and has P 25 and P 75 for its limits. It spans the middle half of the data. The line within the box identifies the median, P 50. Sometimes, an asterisk within the box is shown. It is the mean. The lines coming out of the box are called the whiskers. The ends of these whiskers are called fences. Upper fence = The largest value that is still less than or equal to P *(P 75 P 25 ) = P *(IQR). Lower fence = The smallest value that is still greater than or equal to P *(P 75 P 25 ) = P *(IQR). Upper fence P75 Value identifying upper 25% of data values P50 Value identifying upper (lower) 50% of values MEDIAN P25 Value identifying lower 25% of values Lower fence Note on the multiplier 1.5 This is a convenient multiplier if we are interested in comparing the distribution of our sample values to a normal (Gaussian) distribution in the following way. If the data are from a normal distribution, then 95% of the data values will fall within the range defined by the lower and upper fences.

33 PubHtlth 540 Fall Summarizing Page 33 of 54 Example, continued - Box plot summary of the distribution of the 25 ages age 85 years = largest < P *(P 75 P 25 ) P 75 = 76 years Median, P 50 = 53 years P 25 = 39 years 15 years = smallest > P *(P 75 P 25 ) Note: min=15, P 25 =39, P 50 =53, P 75 =76, max=85, and 1.5*(P 75 -P 25 )=55.5 Example The following is an example of side-by-side box and whisker plots. They are plots of values of duration of military service for 5 groups defined by age. 362 T A F MS Duration of Military Service, Months Quintiles of Age! Duration of military service increases with age.! With age, the variability in duration of military service is greater.! The individual circles represent extreme values.

34 PubHtlth 540 Fall Summarizing Page 34 of The Summation Notation The summation notation is nothing more than a secretarial convenience. We use it to avoid having to write out long expressions. For example, Instead of writing the sum x + x + x + x + x, We write 5 x i i= Another example Instead of writing out the product of five terms x * x * x * x * x, We write 5 i= 1 x i This is actually an example of the product notation The summation notation The Greek symbol sigma says add up some items Below the sigma symbol is the starting point STARTING HERE END Up top is the ending point Example The first 5 values of age on page 13. Using summation notation, what is the sum of the 2 nd, 3 rd, and 4 th values? x=15 x=31 x=75 x=52 x=84 4 i= i x = x + x + x = = 158

35 PubHtlth 540 Fall Summarizing Page 35 of Numerical Summaries for Quantitative - Central Tendency Previously we noted that among the most important tools of description are that address - What is typical (location) - What is the scatter (dispersion) Recall - Good choices for summarizing location and dispersion are not always the same and depend on the pattern of scatter.

36 PubHtlth 540 Fall Summarizing Page 36 of 54 Mode. The mode is the most frequently occurring value. It is not influenced by extreme values. Often, it is not a good summary of the majority of the data. Mean. The mean is the arithmetic average of the values. It is sensitive to extreme values. Mean = sum of values = Σ (values) sample size n Median. The median is the middle value when the sample size is odd. For samples of even sample size, it is the average of the two middle values. It is not influenced by extreme values. We consider each one in a bit more detail

37 PubHtlth 540 Fall Summarizing Page 37 of 54 7a. Mode Mode. The mode is the most frequently occurring value. It is not influenced by extreme values. Often, it is not a good summary of the majority of the data. Example are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6 Mode is 4 Example are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8 There are two modes value 2 and value 5 This distribution is said to be bi-modal Modal Class For grouped data, it may be possible to speak of a modal class The modal class is the class with the largest frequency Example set of n=80 values of age (years) Interval/Class of Values (age, years ) Frequency, f (# times) The modal class is the interval of values years of age, because values in this range occurred the most often (25 times) in our data set.

38 PubHtlth 540 Fall Summarizing Page 38 of 54 7b. Mean Mean. The mean is the arithmetic average of the values. It is sensitive to extreme values. Mean = sum of values = Σ (values) sample size n Calculation of a mean or average is familiar; e.g. - grade point average mean annual rainfall average weight of a catch of fish average family size for a region A closer look using summation notation introduced on page 34) Suppose data are: 90, 80, 95, 85, sample mean = = = sample size, n = 5 x = 90, x = 80, x = 95, x = 85, x = X = sample mean X = 5 n x x + x + x + x + x = 5 i i= = = 83 5

39 PubHtlth 540 Fall Summarizing Page 39 of 54 7c. The Mean as a Balancing Point and Introduction to Skewness The mean can be thought of as a balancing point, center of gravity scale 83 By balance, it is meant that the sum of the departures from the mean to the left balance out the sum of the departures from the mean to the right. In this example, sample mean X = 83 TIP!! Often, the value of the sample mean is not one that is actually observed Skewness When the data are skewed, the mean is dragged in the direction of the skewness Negative Skewness (Left tail) Positive Skewness (Right tail) Mean is dragged left Mean is dragged right

40 PubHtlth 540 Fall Summarizing Page 40 of 54 7d. The Mean of Grouped Sometimes, data values occur multiple times and it is more convenient to group the data than to list the multiple occurrence of like values individually. The calculation of the sample mean in the setting of grouped data is an extension of the formula for the mean that you have already learned. Each unique data value is multiplied by the frequency with which it occurs in the sample. Example Value of variable X = Frequency in sample is = X 1 = 96 f1 = 20 X 2 = 84 f2 = 20 X 3 = 65 f3 = 20 X 4 = 73 f4 = 10 X 5 = 94 f5 = 30 Grouped mean = ( data value)( frequency of data value) ( frequencies) = n i=1 (f )(X ) i (f ) i i = ( 20)( 96) + ( 20)(84) + ( 20)( 65) + ( 10)( 73) + ( 30)( 94) ( 20) + ( 20) + ( 20) + ( 10) + ( 30) = Tip The use of the weighted mean is often used to estimate the mean in a sample of data that have been summarized in a frequency table. The values used are the interval midpoints. The weights used are the interval frequencies.

41 PubHtlth 540 Fall Summarizing Page 41 of 54 7e. The Median Median. The median is the middle value when the sample size is odd. For samples of even sample size, it is the average of the two middle values. It is not influenced by extreme values. Recall: If the sample size n is ODD median = n +1 th largest value 2 If the sample size n is EVEN median n n + 2 = average of, values 2 th 2 th Example, from smallest to largest, are: 1, 1, 2, 3, 7, 8, 11, 12, 14, 19, 20 The sample size, n=11 n+1 12 Median is the th largest = 6th largest value 2 2 = Thus, reading from left (smallest) to right (largest), the median value is = 8 1, 1, 2, 3, 7, 8, 11, 12, 14, 19, 20 Five values are smaller than 8; five values are larger.

42 PubHtlth 540 Fall Summarizing Page 42 of 54 Example, from smallest to largest, are: 2, 5, 5, 6, 7, 10, 15, 21, 22, 23, 23, 25 The sample size, n=12 Median = average of n n + 2 th largest, = average of 6th and 7th largest values 2 2 Thus, median value is the average of the 6 th value=10 and the 7 th value = 15, which is equal to 12.5 Skewed When the data are skewed the median is a better description of the majority than the mean Example are: 14, 89, 93, 95, 96 Skewness is reflected in the outlying low value of 14 The sample mean is 77.4 The median is 93 Negative Skewness (Left tail) MEAN < Median Positive Skewness (Right tail) MEAN > Median Mean is dragged left MEDIAN MEDIAN Mean is dragged right

43 PubHtlth 540 Fall Summarizing Page 43 of Numerical Summaries for Continuous - Dispersion There are choices for describing dispersion, too. As before, a good choice will depend on the shape of the distribution.

44 PubHtlth 540 Fall Summarizing Page 44 of 54 8a. Variance Two quick reminders: (1) a parameter is a numerical fact about a population (eg the average age of every citizen in the United States population); (2) a statistic is a number calculated from a sample (eg the average age of a random sample of 50 citizens). Population Mean, µ. One example of a parameter is the population mean. It is written as µ and, for a finite sized population, it is the average of all the data values for a variable, taken over all the members of the population. Population Variance, σ 2. The population variance is also a parameter. It is written as σ 2 and is a summary measure of the squares of individual departures from the mean in a population. If we re lucky and we re dealing with a population that is finite in size (yes, it s theoretically possible to have a population of infinite size more on this later) and of size N, there exists a formula for population variance. This formula makes use of the mean of the population which is represented as µ. σ = N 2 i=1 ( X-µ ) How to interpret the population variance: It is the average of the individual squared deviations from the mean. Think of it as answering the question Typically, how scattered are the individual data points? i N 2 variance, s 2 A sample variance is a statistic; thus, it is a number calculated from the data in a sample. The sample variance is written as S 2 and is a summary measure of the squares of individual departures from the sample mean in a sample. For a simple random sample of size n (recall we use the notation N when we speak of the size of a finite population and we use the notation n when we speak of the size of a sample) S = 2 i=1 n ( X i - X) (n-1) Notice that the formula for the sample variance is very similar to the formula for a finite population variance (1) N is replaced by n (2) µ is replaced by X and (3) the divisor N is replaced by the divisor (n-1). This suggests that, provided the sample is a representative one, the sample variance might be a good guess (estimate) of the population variance. 2

45 PubHtlth 540 Fall Summarizing Page 45 of 54 8b. Standard Deviation Standard Deviation, s. The population standard deviation (σ) and a sample standard deviation (S or SD) are the square roots of σ 2 and S 2. As such, they are additional choices for summarizing variability. The advantage of the square root operation is that the resulting summary has the same scale as the original values. Standard Deviation (S or SD) = ( X X) 2 n 1 Disparity between individual and average Disparity between individual and average The average of these ( X X ) ( X X ) 2 ( X X ) n 2 The sample variance S 2 is an almost average The related measure S (or SD) returns measure of dispersion to original scale of observation ( X X ) n 1 2 ( X X ) n 1 2

46 PubHtlth 540 Fall Summarizing Page 46 of 54 Example of Variance (S 2 ) and Standard Deviation (S) Calculation Consider the following sample of survival times (X) of n=11 patients after heart transplant surgery. Interest is to calculate the sample variance and standard deviation. Patients are identified numerically, from 1 to 11. The survival time for the ith patient is represented as X i for i= 1,, 11. Patient Identifier, i Survival (days), X i Mean for sample, X Deviation, X i X ( ) Squared deviation ( X i X) TOTAL X i = 1771 i= 1 days mean is X = = 161 days variance is s 2 = 11 i=1 ( X i X) n-1 2 = = days 2 standard deviation is s = s 2 = = days

47 PubHtlth 540 Fall Summarizing Page 47 of 54 8c. Median Absolute Deviation About the Median (MADM) Median Absolute Deviation about the Median (MADM) - Another measure of variability is helpful when we wish to describe scatter among data that is skewed. Recall that the median is a good measure of location for skewed data because it is not sensitive to extreme values. Distances are measured about the median, not the mean. We compute deviations rather than squared differences. Thus Median Absolute Deviation about the Median (MADM) MADM = median of [ X i - median of {X 1,...,X n} ] Example. Original data: { 0.7, 1.6, 2.2, 3.2, 9.8 } Median = 2.2 X i X i median MADM = median { 0.0, 0.6, 1.0, 1.5, 7.6 } = 1.0

48 PubHtlth 540 Fall Summarizing Page 48 of 54 8d. Standard Deviation (S or SD) versus Standard Error (SE) Tip The standard deviation (s or sd) and the standard error (se or sem) are often confused. The standard deviation (SD or S) addresses questions about variability of individuals in nature (imagine a collection of individuals), whereas The standard error (SE) addresses questions about the variability of a summary statistic among many replications of your study (imagine a collection of values of a sample statistic such as the sample mean that is obtained by repeating your whole study over and over again) The distinction has to do with the idea of sampling distributions which are introduced on page 48 and which are re-introduced several times throughout this course. Consider the following illustration of the idea. Example Suppose you conduct a study that involves obtaining a simple random sample of size n=11. Suppose further that, from this one sample, you calculate the sample mean (note you might have calculated other sample statistics, too, such as the median or sample variance). Now imagine replicating the entire study 5000 times. You would then have 5,000 sample means, each based on a sample of size n =11. If instead of replicating your study 5000 times, the study were replicated infinitely many times, the resulting collection of infinitely many sample means has a name: the sampling distribution of n=11. Notice the subscript n=11. This is a reminder to us that the particular study design that we have replicated infinitely many times calls for drawing a sample of size n=11 each time. X So what? Why do we care? We care because, often, we re interested in knowing if the results of our one study conduct are similar to what would be obtained if someone else were to repeat it!!

49 PubHtlth 540 Fall Summarizing Page 49 of 54 Distinction between Standard Deviation (s or sd) and Standard Error of the Mean (se or sem). We re often interested in the (theoretical) behavior of the sample mean to the next. X n from one replication of our study So, whereas, the typical variability among individual values can be described using the standard deviation (SD). The typical variability of the sample mean from one replication of the study is described using the standard error (SE) of the mean: SE( X) = SD n Note A limitation of the SE is that it is a function of both the natural variation (SD in the numerator) and the study design (n in the denominator). More on this later!

50 PubHtlth 540 Fall Summarizing Page 50 of 54 Example, continued Previously, we summarized the results of one study that enrolled n=11 patients after heart transplant surgery. For that one study, we obtained an average survival time of X =161 days. What happens if we repeat the study? What will our next X be? Will it be close? How different will it be? We care about this question because it pertains to the generalizability of our study findings. The behavior of X from one replication of the study to the next replication of the study is referred to as the sampling distribution of X. (We could just as well have asked about the behavior of the median from one replication to the next (sampling distribution of the median) or the behavior of the SD from one replication to the next (sampling distribution of SD).) Thus, interest is in a measure of the noise that accompanies X =161 days. The measure we use is the standard error measure. This is denoted SE. For this example, in the heart transplant study SE( X) = SD n = = 50.9 We interpret this to mean that a similarly conducted study might produce an average survival time that is near 161 days, give or take 50.9 days.

51 PubHtlth 540 Fall Summarizing Page 51 of 54 8e. A Feel for Sampling Distributions This is a schematic of a population in nature. It is comprised of the universe of all possible individual values. Population mean = µ Population variance = 2 σ sample sample. sample This is a schematic representing the theoretical replication of study where each study replication (red arrow) involves a new sample of size n and the calculation of a new sample mean X n X n X n X n Collecting together all the X n

52 PubHtlth 540 Fall Summarizing Page 52 of 54 This is a schematic of the resulting sampling distribution of X n. It is the collection of all possible sample means. Sampling distribution has mean = µ X n Sampling distribution has 2 variance = σ Xn Standard Deviation Describes variation in values of individuals. Standard Error Describes variation in values of a statistic from one conduct of study to the next. Often, it is the variation in the sample mean that interests us. In the population of individuals: σ Our guess is S In the population of all possible sample means ( sampling distribution of mean ): σ n Our guess of the SE of the sample mean is S n

53 PubHtlth 540 Fall Summarizing Page 53 of 54 8f. The Coefficient of Variation The coefficient of variation is the ratio of the standard deviation to the mean of a distribution. It is a measure of the spread of the distribution relative to the mean of the distribution In the population, coefficient of variation is denoted ξ and is defined ξ = σ µ The coefficient of variation ξ can be estimated from a sample. Using the hat notation to indicate guess. It is also denoted CV cv =!ξ = S X Example Cholesterol is more variable than systolic blood pressure S X cv =!ξ = s x Systolic Blood Pressure 15 mm 130 mm.115 Cholesterol 40 mg/dl 200 mg/dl.200 Example Diastolic is relatively more variable than systolic blood pressure S X cv =!ξ = s x Systolic Blood Pressure 15 mm 130 mm.115 Diastolic Blood Pressure 8 mm 60 mm.133

54 PubHtlth 540 Fall Summarizing Page 54 of 54 8g. The Range The range is the difference between the largest and smallest values in a data set. It is a quick measure of scatter but not a very good one. Calculation utilizes only two of the available observations. As n increases, the range can only increase. Thus, the range is sensitive to sample size. The range is an unstable measure of scatter compared to alternative summaries of scatter (e.g. S or MADM) HOWEVER when the sample size is very small, it may be a better measure of scatter than the standard deviation S. Example values are 5, 9, 12, 16, 23, 34, 37, 42 range = 42-5=37

Unit 1 Summarizing Data

Unit 1 Summarizing Data PubHtlth 540 Fall 2013 1. Summarizing Page 1 of 51 Unit 1 Summarizing It is difficult to understand why statisticians commonly limit their enquiries to averages, and do not revel in more comprehensive

More information

Unit 1 Summarizing Data

Unit 1 Summarizing Data PubHtlth 540 Fall 2011 1. Summarizing Data Page 1 of 51 Unit 1 Summarizing Data It is difficult to understand why statisticians commonly limit their enquiries to averages, and do not revel in more comprehensive

More information

Unit 1 Summarizing Data

Unit 1 Summarizing Data BIOSTATS 540 Fall 2016 1. Summarizing Page 1 of 42 Unit 1 Summarizing It is difficult to understand why statisticians commonly limit their enquiries to averages, and do not revel in more comprehensive

More information

Unit 1 Summarizing Data

Unit 1 Summarizing Data BIOSTATS 540 Fall 2018 1. Summarizing Page 1 of 47 Unit 1 Summarizing It is difficult to understand why statisticians commonly limit their enquiries to averages, and do not revel in more comprehensive

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

STAT 200 Chapter 1 Looking at Data - Distributions

STAT 200 Chapter 1 Looking at Data - Distributions STAT 200 Chapter 1 Looking at Data - Distributions What is Statistics? Statistics is a science that involves the design of studies, data collection, summarizing and analyzing the data, interpreting the

More information

Introduction to Statistics

Introduction to Statistics Introduction to Statistics Data and Statistics Data consists of information coming from observations, counts, measurements, or responses. Statistics is the science of collecting, organizing, analyzing,

More information

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004 UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 50W - Introduction to Biostatistics Fall 00 Exercises with Solutions Topic Summarizing Data Due: Monday September 7, 00 READINGS.

More information

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty. Statistics is a field of study concerned with the data collection,

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

University of Jordan Fall 2009/2010 Department of Mathematics

University of Jordan Fall 2009/2010 Department of Mathematics handouts Part 1 (Chapter 1 - Chapter 5) University of Jordan Fall 009/010 Department of Mathematics Chapter 1 Introduction to Introduction; Some Basic Concepts Statistics is a science related to making

More information

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that? Tastitsticsss? What s that? Statistics describes random mass phanomenons. Principles of Biostatistics and Informatics nd Lecture: Descriptive Statistics 3 th September Dániel VERES Data Collecting (Sampling)

More information

1. Exploratory Data Analysis

1. Exploratory Data Analysis 1. Exploratory Data Analysis 1.1 Methods of Displaying Data A visual display aids understanding and can highlight features which may be worth exploring more formally. Displays should have impact and be

More information

Descriptive statistics

Descriptive statistics Patrick Breheny February 6 Patrick Breheny to Biostatistics (171:161) 1/25 Tables and figures Human beings are not good at sifting through large streams of data; we understand data much better when it

More information

Units. Exploratory Data Analysis. Variables. Student Data

Units. Exploratory Data Analysis. Variables. Student Data Units Exploratory Data Analysis Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 13th September 2005 A unit is an object that can be measured, such as

More information

PubHlth 540 Fall Summarizing Data Page 1 of 18. Unit 1 - Summarizing Data Practice Problems. Solutions

PubHlth 540 Fall Summarizing Data Page 1 of 18. Unit 1 - Summarizing Data Practice Problems. Solutions PubHlth 50 Fall 0. Summarizing Data Page of 8 Unit - Summarizing Data Practice Problems Solutions #. a. Qualitative - ordinal b. Qualitative - nominal c. Quantitative continuous, ratio d. Qualitative -

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics CHAPTER OUTLINE 6-1 Numerical Summaries of Data 6- Stem-and-Leaf Diagrams 6-3 Frequency Distributions and Histograms 6-4 Box Plots 6-5 Time Sequence Plots 6-6 Probability Plots Chapter

More information

Statistics I Chapter 2: Univariate data analysis

Statistics I Chapter 2: Univariate data analysis Statistics I Chapter 2: Univariate data analysis Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram,

More information

Chapter 3. Data Description

Chapter 3. Data Description Chapter 3. Data Description Graphical Methods Pie chart It is used to display the percentage of the total number of measurements falling into each of the categories of the variable by partition a circle.

More information

A is one of the categories into which qualitative data can be classified.

A is one of the categories into which qualitative data can be classified. Chapter 2 Methods for Describing Sets of Data 2.1 Describing qualitative data Recall qualitative data: non-numerical or categorical data Basic definitions: A is one of the categories into which qualitative

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. Week 1 Chapter 1 Introduction What is Statistics? Why do you need to know Statistics? Technical lingo and concepts:

More information

Statistics I Chapter 2: Univariate data analysis

Statistics I Chapter 2: Univariate data analysis Statistics I Chapter 2: Univariate data analysis Chapter 2: Univariate data analysis Contents Graphical displays for categorical data (barchart, piechart) Graphical displays for numerical data data (histogram,

More information

Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz

Measures of Central Tendency and their dispersion and applications. Acknowledgement: Dr Muslima Ejaz Measures of Central Tendency and their dispersion and applications Acknowledgement: Dr Muslima Ejaz LEARNING OBJECTIVES: Compute and distinguish between the uses of measures of central tendency: mean,

More information

Chapter 4. Displaying and Summarizing. Quantitative Data

Chapter 4. Displaying and Summarizing. Quantitative Data STAT 141 Introduction to Statistics Chapter 4 Displaying and Summarizing Quantitative Data Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 31 4.1 Histograms 1 We divide the range

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511

Types of Information. Topic 2 - Descriptive Statistics. Examples. Sample and Sample Size. Background Reading. Variables classified as STAT 511 Topic 2 - Descriptive Statistics STAT 511 Professor Bruce Craig Types of Information Variables classified as Categorical (qualitative) - variable classifies individual into one of several groups or categories

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:

More information

Description of Samples and Populations

Description of Samples and Populations Description of Samples and Populations Random Variables Data are generated by some underlying random process or phenomenon. Any datum (data point) represents the outcome of a random variable. We represent

More information

Lecture 1 : Basic Statistical Measures

Lecture 1 : Basic Statistical Measures Lecture 1 : Basic Statistical Measures Jonathan Marchini October 11, 2004 In this lecture we will learn about different types of data encountered in practice different ways of plotting data to explore

More information

Chapter 1. Looking at Data

Chapter 1. Looking at Data Chapter 1 Looking at Data Types of variables Looking at Data Be sure that each variable really does measure what you want it to. A poor choice of variables can lead to misleading conclusions!! For example,

More information

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected

What is statistics? Statistics is the science of: Collecting information. Organizing and summarizing the information collected What is statistics? Statistics is the science of: Collecting information Organizing and summarizing the information collected Analyzing the information collected in order to draw conclusions Two types

More information

Descriptive Statistics-I. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami Descriptive Statistics-I Dr Mahmoud Alhussami Biostatistics What is the biostatistics? A branch of applied math. that deals with collecting, organizing and interpreting data using well-defined procedures.

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

2 Descriptive Statistics

2 Descriptive Statistics 2 Descriptive Statistics Reading: SW Chapter 2, Sections 1-6 A natural first step towards answering a research question is for the experimenter to design a study or experiment to collect data from the

More information

STT 315 This lecture is based on Chapter 2 of the textbook.

STT 315 This lecture is based on Chapter 2 of the textbook. STT 315 This lecture is based on Chapter 2 of the textbook. Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan and Dr. Parthanil Roy for allowing him to use/edit some of their

More information

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami

Unit Two Descriptive Biostatistics. Dr Mahmoud Alhussami Unit Two Descriptive Biostatistics Dr Mahmoud Alhussami Descriptive Biostatistics The best way to work with data is to summarize and organize them. Numbers that have not been summarized and organized are

More information

Describing distributions with numbers

Describing distributions with numbers Describing distributions with numbers A large number or numerical methods are available for describing quantitative data sets. Most of these methods measure one of two data characteristics: The central

More information

CIVL 7012/8012. Collection and Analysis of Information

CIVL 7012/8012. Collection and Analysis of Information CIVL 7012/8012 Collection and Analysis of Information Uncertainty in Engineering Statistics deals with the collection and analysis of data to solve real-world problems. Uncertainty is inherent in all real

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 1- part 1: Describing variation, and graphical presentation Outline Sources of variation Types of variables Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data:

Lecture 2. Quantitative variables. There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data: Lecture 2 Quantitative variables There are three main graphical methods for describing, summarizing, and detecting patterns in quantitative data: Stemplot (stem-and-leaf plot) Histogram Dot plot Stemplots

More information

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives F78SC2 Notes 2 RJRC Algebra It is useful to use letters to represent numbers. We can use the rules of arithmetic to manipulate the formula and just substitute in the numbers at the end. Example: 100 invested

More information

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES

STP 420 INTRODUCTION TO APPLIED STATISTICS NOTES INTRODUCTION TO APPLIED STATISTICS NOTES PART - DATA CHAPTER LOOKING AT DATA - DISTRIBUTIONS Individuals objects described by a set of data (people, animals, things) - all the data for one individual make

More information

Chapter2 Description of samples and populations. 2.1 Introduction.

Chapter2 Description of samples and populations. 2.1 Introduction. Chapter2 Description of samples and populations. 2.1 Introduction. Statistics=science of analyzing data. Information collected (data) is gathered in terms of variables (characteristics of a subject that

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions

Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions Summarizing and Displaying Measurement Data/Understanding and Comparing Distributions Histograms, Mean, Median, Five-Number Summary and Boxplots, Standard Deviation Thought Questions 1. If you were to

More information

Exploring, summarizing and presenting data. Berghold, IMI, MUG

Exploring, summarizing and presenting data. Berghold, IMI, MUG Exploring, summarizing and presenting data Example Patient Nr Gender Age Weight Height PAVK-Grade W alking Distance Physical Functioning Scale Total Cholesterol Triglycerides 01 m 65 90 185 II b 200 70

More information

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data

Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data Chapter 2: Descriptive Analysis and Presentation of Single- Variable Data Mean 26.86667 Standard Error 2.816392 Median 25 Mode 20 Standard Deviation 10.90784 Sample Variance 118.981 Kurtosis -0.61717 Skewness

More information

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart ST2001 2. Presenting & Summarising Data Descriptive Statistics Frequency Distribution, Histogram & Bar Chart Summary of Previous Lecture u A study often involves taking a sample from a population that

More information

TOPIC: Descriptive Statistics Single Variable

TOPIC: Descriptive Statistics Single Variable TOPIC: Descriptive Statistics Single Variable I. Numerical data summary measurements A. Measures of Location. Measures of central tendency Mean; Median; Mode. Quantiles - measures of noncentral tendency

More information

Exercises from Chapter 3, Section 1

Exercises from Chapter 3, Section 1 Exercises from Chapter 3, Section 1 1. Consider the following sample consisting of 20 numbers. (a) Find the mode of the data 21 23 24 24 25 26 29 30 32 34 39 41 41 41 42 43 48 51 53 53 (b) Find the median

More information

Clinical Research Module: Biostatistics

Clinical Research Module: Biostatistics Clinical Research Module: Biostatistics Lecture 1 Alberto Nettel-Aguirre, PhD, PStat These lecture notes based on others developed by Drs. Peter Faris, Sarah Rose Luz Palacios-Derflingher and myself Who

More information

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users BIOSTATS 640 Spring 2017 Review of Introductory Biostatistics STATA solutions Page 1 of 16 Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users #1. The following table lists length of

More information

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text) Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text) 1. A quick and easy indicator of dispersion is a. Arithmetic mean b. Variance c. Standard deviation

More information

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population

3.1 Measures of Central Tendency: Mode, Median and Mean. Average a single number that is used to describe the entire sample or population . Measures of Central Tendency: Mode, Median and Mean Average a single number that is used to describe the entire sample or population. Mode a. Easiest to compute, but not too stable i. Changing just one

More information

Statistics and parameters

Statistics and parameters Statistics and parameters Tables, histograms and other charts are used to summarize large amounts of data. Often, an even more extreme summary is desirable. Statistics and parameters are numbers that characterize

More information

P8130: Biostatistical Methods I

P8130: Biostatistical Methods I P8130: Biostatistical Methods I Lecture 2: Descriptive Statistics Cody Chiuzan, PhD Department of Biostatistics Mailman School of Public Health (MSPH) Lecture 1: Recap Intro to Biostatistics Types of Data

More information

are the objects described by a set of data. They may be people, animals or things.

are the objects described by a set of data. They may be people, animals or things. ( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r 2016 C h a p t e r 5 : E x p l o r i n g D a t a : D i s t r i b u t i o n s P a g e 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms

More information

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27

Chapter 7: Statistics Describing Data. Chapter 7: Statistics Describing Data 1 / 27 Chapter 7: Statistics Describing Data Chapter 7: Statistics Describing Data 1 / 27 Categorical Data Four ways to display categorical data: 1 Frequency and Relative Frequency Table 2 Bar graph (Pareto chart)

More information

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable

QUANTITATIVE DATA. UNIVARIATE DATA data for one variable QUANTITATIVE DATA Recall that quantitative (numeric) data values are numbers where data take numerical values for which it is sensible to find averages, such as height, hourly pay, and pulse rates. UNIVARIATE

More information

Chapter 2. Mean and Standard Deviation

Chapter 2. Mean and Standard Deviation Chapter 2. Mean and Standard Deviation The median is known as a measure of location; that is, it tells us where the data are. As stated in, we do not need to know all the exact values to calculate the

More information

Practice problems from chapters 2 and 3

Practice problems from chapters 2 and 3 Practice problems from chapters and 3 Question-1. For each of the following variables, indicate whether it is quantitative or qualitative and specify which of the four levels of measurement (nominal, ordinal,

More information

1.3.1 Measuring Center: The Mean

1.3.1 Measuring Center: The Mean 1.3.1 Measuring Center: The Mean Mean - The arithmetic average. To find the mean (pronounced x bar) of a set of observations, add their values and divide by the number of observations. If the n observations

More information

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation

Chapter Four. Numerical Descriptive Techniques. Range, Standard Deviation, Variance, Coefficient of Variation Chapter Four Numerical Descriptive Techniques 4.1 Numerical Descriptive Techniques Measures of Central Location Mean, Median, Mode Measures of Variability Range, Standard Deviation, Variance, Coefficient

More information

CHAPTER 1. Introduction

CHAPTER 1. Introduction CHAPTER 1 Introduction Engineers and scientists are constantly exposed to collections of facts, or data. The discipline of statistics provides methods for organizing and summarizing data, and for drawing

More information

Biostatistics Presentation of data DR. AMEER KADHIM HUSSEIN M.B.CH.B.FICMS (COM.)

Biostatistics Presentation of data DR. AMEER KADHIM HUSSEIN M.B.CH.B.FICMS (COM.) Biostatistics Presentation of data DR. AMEER KADHIM HUSSEIN M.B.CH.B.FICMS (COM.) PRESENTATION OF DATA 1. Mathematical presentation (measures of central tendency and measures of dispersion). 2. Tabular

More information

Histograms allow a visual interpretation

Histograms allow a visual interpretation Chapter 4: Displaying and Summarizing i Quantitative Data s allow a visual interpretation of quantitative (numerical) data by indicating the number of data points that lie within a range of values, called

More information

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data

Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data Chapter 2: Summarising numerical data Further Mathematics 2018 CORE: Data analysis Chapter 2 Summarising numerical data Extract from Study Design Key knowledge Types of data: categorical (nominal and ordinal)

More information

Chapter 1: Exploring Data

Chapter 1: Exploring Data Chapter 1: Exploring Data Section 1.3 with Numbers The Practice of Statistics, 4 th edition - For AP* STARNES, YATES, MOORE Chapter 1 Exploring Data Introduction: Data Analysis: Making Sense of Data 1.1

More information

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22

Announcements. Lecture 1 - Data and Data Summaries. Data. Numerical Data. all variables. continuous discrete. Homework 1 - Out 1/15, due 1/22 Announcements Announcements Lecture 1 - Data and Data Summaries Statistics 102 Colin Rundel January 13, 2013 Homework 1 - Out 1/15, due 1/22 Lab 1 - Tomorrow RStudio accounts created this evening Try logging

More information

Stat 101 Exam 1 Important Formulas and Concepts 1

Stat 101 Exam 1 Important Formulas and Concepts 1 1 Chapter 1 1.1 Definitions Stat 101 Exam 1 Important Formulas and Concepts 1 1. Data Any collection of numbers, characters, images, or other items that provide information about something. 2. Categorical/Qualitative

More information

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS

CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS CHAPTER 8 INTRODUCTION TO STATISTICAL ANALYSIS LEARNING OBJECTIVES: After studying this chapter, a student should understand: notation used in statistics; how to represent variables in a mathematical form

More information

BIOS 2041: Introduction to Statistical Methods

BIOS 2041: Introduction to Statistical Methods BIOS 2041: Introduction to Statistical Methods Abdus S Wahed* *Some of the materials in this chapter has been adapted from Dr. John Wilson s lecture notes for the same course. Chapter 0 2 Chapter 1 Introduction

More information

Performance of fourth-grade students on an agility test

Performance of fourth-grade students on an agility test Starter Ch. 5 2005 #1a CW Ch. 4: Regression L1 L2 87 88 84 86 83 73 81 67 78 83 65 80 50 78 78? 93? 86? Create a scatterplot Find the equation of the regression line Predict the scores Chapter 5: Understanding

More information

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability STA301- Statistics and Probability Solved MCQS From Midterm Papers March 19,2012 MC100401285 Moaaz.pk@gmail.com Mc100401285@gmail.com PSMD01 MIDTERM EXAMINATION (Spring 2011) STA301- Statistics and Probability

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Keystone Exams: Algebra

Keystone Exams: Algebra KeystoneExams:Algebra TheKeystoneGlossaryincludestermsanddefinitionsassociatedwiththeKeystoneAssessmentAnchorsand Eligible Content. The terms and definitions included in the glossary are intended to assist

More information

Resistant Measure - A statistic that is not affected very much by extreme observations.

Resistant Measure - A statistic that is not affected very much by extreme observations. Chapter 1.3 Lecture Notes & Examples Section 1.3 Describing Quantitative Data with Numbers (pp. 50-74) 1.3.1 Measuring Center: The Mean Mean - The arithmetic average. To find the mean (pronounced x bar)

More information

Chapter 5. Understanding and Comparing. Distributions

Chapter 5. Understanding and Comparing. Distributions STAT 141 Introduction to Statistics Chapter 5 Understanding and Comparing Distributions Bin Zou (bzou@ualberta.ca) STAT 141 University of Alberta Winter 2015 1 / 27 Boxplots How to create a boxplot? Assume

More information

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data

Review for Exam #1. Chapter 1. The Nature of Data. Definitions. Population. Sample. Quantitative data. Qualitative (attribute) data Review for Exam #1 1 Chapter 1 Population the complete collection of elements (scores, people, measurements, etc.) to be studied Sample a subcollection of elements drawn from a population 11 The Nature

More information

Describing Distributions with Numbers

Describing Distributions with Numbers Topic 2 We next look at quantitative data. Recall that in this case, these data can be subject to the operations of arithmetic. In particular, we can add or subtract observation values, we can sort them

More information

Lectures of STA 231: Biostatistics

Lectures of STA 231: Biostatistics Lectures of STA 231: Biostatistics Second Semester Academic Year 2016/2017 Text Book Biostatistics: Basic Concepts and Methodology for the Health Sciences (10 th Edition, 2014) By Wayne W. Daniel Prepared

More information

MATH 117 Statistical Methods for Management I Chapter Three

MATH 117 Statistical Methods for Management I Chapter Three Jubail University College MATH 117 Statistical Methods for Management I Chapter Three This chapter covers the following topics: I. Measures of Center Tendency. 1. Mean for Ungrouped Data (Raw Data) 2.

More information

Comparing Measures of Central Tendency *

Comparing Measures of Central Tendency * OpenStax-CNX module: m11011 1 Comparing Measures of Central Tendency * David Lane This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 1.0 1 Comparing Measures

More information

The science of learning from data.

The science of learning from data. STATISTICS (PART 1) The science of learning from data. Numerical facts Collection of methods for planning experiments, obtaining data and organizing, analyzing, interpreting and drawing the conclusions

More information

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS

2/2/2015 GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY MEASURES OF CENTRAL TENDENCY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Spring 2015: Lembo GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY CHAPTER 3: DESCRIPTIVE STATISTICS AND GRAPHICS Descriptive statistics concise and easily understood summary of data set characteristics

More information

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table Lesson Plan Answer Questions Summary Statistics Histograms The Normal Distribution Using the Standard Normal Table 1 2. Summary Statistics Given a collection of data, one needs to find representations

More information

COMPLEMENTARY EXERCISES WITH DESCRIPTIVE STATISTICS

COMPLEMENTARY EXERCISES WITH DESCRIPTIVE STATISTICS COMPLEMENTARY EXERCISES WITH DESCRIPTIVE STATISTICS EX 1 Given the following series of data on Gender and Height for 8 patients, fill in two frequency tables one for each Variable, according to the model

More information

Preliminary Statistics course. Lecture 1: Descriptive Statistics

Preliminary Statistics course. Lecture 1: Descriptive Statistics Preliminary Statistics course Lecture 1: Descriptive Statistics Rory Macqueen (rm43@soas.ac.uk), September 2015 Organisational Sessions: 16-21 Sep. 10.00-13.00, V111 22-23 Sep. 15.00-18.00, V111 24 Sep.

More information

MA30S APPLIED UNIT F: DATA MANAGEMENT CLASS NOTES

MA30S APPLIED UNIT F: DATA MANAGEMENT CLASS NOTES 1 MA30S APPLIED UNIT F: DATA MANAGEMENT CLASS NOTES 1. We represent mathematical information in more ways than just using equations! Often a simple graph or chart or picture can represent a lot of information.

More information

Lecture Notes 2: Variables and graphics

Lecture Notes 2: Variables and graphics Highlights: Lecture Notes 2: Variables and graphics Quantitative vs. qualitative variables Continuous vs. discrete and ordinal vs. nominal variables Frequency distributions Pie charts Bar charts Histograms

More information

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode. Chapter 3 Numerically Summarizing Data Chapter 3.1 Measures of Central Tendency Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode. A1. Mean The

More information

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things.

CHAPTER 5: EXPLORING DATA DISTRIBUTIONS. Individuals are the objects described by a set of data. These individuals may be people, animals or things. (c) Epstein 2013 Chapter 5: Exploring Data Distributions Page 1 CHAPTER 5: EXPLORING DATA DISTRIBUTIONS 5.1 Creating Histograms Individuals are the objects described by a set of data. These individuals

More information

Introduction to Statistics

Introduction to Statistics Why Statistics? Introduction to Statistics To develop an appreciation for variability and how it effects products and processes. Study methods that can be used to help solve problems, build knowledge and

More information

Statistics for Managers using Microsoft Excel 6 th Edition

Statistics for Managers using Microsoft Excel 6 th Edition Statistics for Managers using Microsoft Excel 6 th Edition Chapter 3 Numerical Descriptive Measures 3-1 Learning Objectives In this chapter, you learn: To describe the properties of central tendency, variation,

More information

Determining the Spread of a Distribution

Determining the Spread of a Distribution Determining the Spread of a Distribution 1.3-1.5 Cathy Poliak, Ph.D. cathy@math.uh.edu Department of Mathematics University of Houston Lecture 3-2311 Lecture 3-2311 1 / 58 Outline 1 Describing Quantitative

More information

Determining the Spread of a Distribution

Determining the Spread of a Distribution Determining the Spread of a Distribution 1.3-1.5 Cathy Poliak, Ph.D. cathy@math.uh.edu Department of Mathematics University of Houston Lecture 3-2311 Lecture 3-2311 1 / 58 Outline 1 Describing Quantitative

More information

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table 2.0 Lesson Plan Answer Questions 1 Summary Statistics Histograms The Normal Distribution Using the Standard Normal Table 2. Summary Statistics Given a collection of data, one needs to find representations

More information

Summarising numerical data

Summarising numerical data 2 Core: Data analysis Chapter 2 Summarising numerical data 42 Core Chapter 2 Summarising numerical data 2A Dot plots and stem plots Even when we have constructed a frequency table, or a histogram to display

More information

GRE Quantitative Reasoning Practice Questions

GRE Quantitative Reasoning Practice Questions GRE Quantitative Reasoning Practice Questions y O x 7. The figure above shows the graph of the function f in the xy-plane. What is the value of f (f( ))? A B C 0 D E Explanation Note that to find f (f(

More information