Analytical Graphing lets start with the best graph ever made Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812. Beginning at the Polish-Russian border, the thick band shows the size of the army at each position. The path of Napoleon's retreat from Moscow in the bitterly cold winter is depicted by the dark lower band, which is tied to temperature and time scales. The graph illustrates an amazing point how an army of, can dwindle to, without losing a single major battle. 1
When is a graph appropriate? Always for data exploration Often for data analysis and to develop predictions (models) and experimental designs Sometimes for presentations Less often for publications Data Exploration Is not snooping in the pejorative sense. Exploration is a necessary and desired operation for: Checking data for unusual values Making sure the data meet the assumptions of the chosen form of analysis Eg normality, homogeneity of variances, linearity (in regression approaches) deciding (sometimes) what sort of analysis to do. This hopefully will have been done prior to initiating a study To look for patterns that may not be expected or apparent this is indeed snooping but it is an essential part of hypothesis formation 2
Count Count Data Exploration Checking data for unusual values Making sure the data meet the assumptions of the chosen form of analysis See ourworld - pop_86 Determining distributions and outliers Will a transformation help?? 1.8.7.6.5.4.3.2.1. 1 Population of countries (1986) Proportion per Bar 15 5 1....3.2.1. 1... Population of countries (1986) Proportion per Bar 3
Data Exploration Is not snooping in the pejorative sense. Exploration is a necessary and desired operation for: Checking data for unusual values Making sure the data meet the assumptions of the chosen form of analysis Eg normality, homogeneity of variances, linearity (in regression approaches) The relationship between birth and death rates (ourworld) Is it linear, or is there perhaps a more appropriate model DEATH_82 BIRTH_82 4
Clearly not linear using LOESS procedure (locally weighted scatterplot smoothing): a non-parametric regression method that combines multiple regression models in a k-nearest-neighbor-based meta-mode DEATH_82 BIRTH_82 When is a graph appropriate? Often for data analysis (e.g.) To understand the nature of interaction terms (more later) To understand the power of a test. Say we wanted to determine sample size for an experiment where we thought the response would be around (alternate Hypothesis =) the standard deviation about 8 and we were willing to relax alpha (from.5 to.) 5
For example the effect of relaxing alpha on power Pop. Mean = Alternative = SD = 8 Alpha=.5,. Power 1..9.8.7.6.5.4.3 Power Curve (Alpha =.).2 5 15 25 35 Sample Size (per cell) Power 1..9.8.7.6.5.4 Power Curve (Alpha =.).3 5 15 Sample Size (per cell) When is a graph appropriate? Sometimes for presentations Idea is to communicate information quickly Be sure you know why you are presenting the graph (is it to convey stats or some other information (we will talk about this more later) Graphs should be simple and not contain too much information never have a graph that is not interpretable So many factors involved that no one could figure it out, or worse 6
I know you can t really see this but. P OP _1986 P OP _199 P OP _ B IR TH _82 B IR TH _R T D E A TH _82 D E A TH _R T B A B Y MT82 B A B Y MOR T LIFE _E X P GN P _82 GN P _86 GD P _C A P LOG_GD P E D U C _84 E DUC H E A LTH 84 H E A LTH P OP _1983 P OP _1986 P OP _199 P OP _ B IR TH _82 B IR TH _R T D E A TH _82 D E A TH _R T B A B Y MT82 B A B Y MOR T LIFE _E X P GN P _82 GN P _86 GD P _C A P LOG_GD P E D U C _84 E DUC H E A LTH 84 H E A LTH POP_1983 POP_1983 P OP _1983 POP_1986 POP_1986 POP_199 POP_199 BIR TH _R T BIR TH _R T BIR TH _82 BIR TH _82 POP_ POP_ D EATH _R T D EATH _R T D EATH _82 D EATH _82 BABYMT82 BABYMT82 BABYMOR T BABYMOR T ED U C ED U C ED U C _84 ED U C _84 LOG_GD P LOG_GD P GD P_C AP GD P_C AP GN P_86 GN P_86 GN P_82 GN P_82 LIFE_EXP LIFE_EXP H EALTH 84 H EALTH 84 H EALTH H EALTH These are usually presented to demonstrate how much work the researcher has done really conveys that he or she has not adequately prepared the presentation When is a graph appropriate? Less often for publications Idea is to communicate information that is too complex to leave in tables or text They typically depict rather than present information (you have to read across to axes to get numbers). Hence if precise bits of information are important to the argument being made use tables. If a graph is presented it must be important to the argument being made in the text (no fluff graphs) Information cannot be presented twice (eg table and figure, text and figure) If a graph is presented it must be interpretable You should be able to understand the purpose and content of the figure directly from the legend. 7
Basics of analytical graph theory Graph types imply a basis of logic and are not always interchangeable Even interchangeable graph types are not always equivalent (some are just non-informative) Be very clear about what you are trying to convey: models, stats or data structure Graph construction (axes, scales etc) may obscure or make clear the points you are trying to make Graph trickery is usually just that and typically subtracts from the depiction Graph types imply a basis of logic and are not always interchangeable Summary Charts Density Charts Scatterplots, quantile plots and probability plots 8
Summary Charts There are a series of general graphical displays useful for characterizing the relationship between independent variables (usually categorical) and summary statistics of dependent variables (usually continuous). An example would be a bar graph of the relationship between education and income (see survey2 data). Some types of summary charts: Examples of continuous and categorical variables Categorical Gender (male, female) Nationality (French, Italian) Species (Human, Chimp) Color (red, green, blue) Age Group (Young, Old) Height Group (Short, Medium, Tall) Weight Group (Thin, Obese) Speed (Fast, Slow) Continuous Hormone level Location (Latitude, Longitude) Phylogenetic distance Color (wavelength) Age (years, days) Height (cm, inches) Weight (grams, pounds) Speed (cm/sec) Temperature (Cold, Warm) Temperature (degrees C) 9
7 Bar Dot Line 7 7 6 6 6 INCOME INCOME INCOME 7 no grad hs hs grad some college college grad no grad hs hs grad some college college grad no grad hs hs grad some college EDUC EDUC EDUC Profile Pyramid Pie 7 college grad 6 6 hs grad INCOME INCOME no grad hs some college no grad hs hs grad some college EDUC college grad no grad hs hs grad some college EDUC college grad college grad Which conveys the information most clearly how about the comparisons of interest 7 6 INCOME no grad hs hs grad some college EDUC college grad SEX Female Male no grad hs hs grad Female some college EDUC college grad SEX Male no grad hs hs grad some college EDUC college grad 7 6 INCOME INCOME INCOME 7 6 6 7 no grad hs hs grad some college EDUC college grad SEX Female Male
Density Charts The density of a sample is the relative concentration of data points in intervals across the range of the distribution. A histogram is one way to display the density of a quantitative variable; box plots, dot or symmetric dot density, frequency polygons, fuzzygrams, jitter plots, density stripes, and histograms with data-driven bar widths are others. Histogram Length (mm) 11
Features of a BOXPLOT Rather than comparing sample values to the normal distribution (mean, standard deviation, and so on), box plots show robust (what does this mean) statistics (median, quartiles, and so on). confidence interval hinge median hinge outliers mean 25% 25% 25% 25% Smallest % Statistical Range Y Raw Data Plots:e.g. Scatterplots, Scatterplots are probably the most common form of graphical display. The key feature of scatterplots is that raw data are plotted (in contrast to summary data as in summary charts). Regression lines with confidence bands or smoothers (e.g. linear, non-linear) can be added to help explain relationships among variables. An example is the relationship between mussel height, and length and mussel height and mussel mass. How to estimate length and mass of mussels? 12
Height Length Non-linear and linear smoothing Each point is a mussel 13
Scatterplots, quantile plots and probability plots Quantile plots and probability plots are useful for studying the distribution of a variable. Quantile Plots produces quantile plots, or Q plots. Unlike probability plots, which compare a sample to a theoretical probability distribution, a quantile plot compares a sample to its own quantiles (a one-sample plot) or to another sample (a two-sample, or Q-Q, plot). The quantile of a sample is the data point corresponding to a given fraction of the data. See ourworld (pop_1986 ) Features of a Quantile Plot Distribution of data Fraction of Data 1..9.8.7.6.5.4.3.2.1 86% of countries had populations less or equal to million people. 1 POP_1986 Distribution of quantiles (should be uniform but is subject to sample size) 14
Scatterplots, quantile plots and probability plots A Probability Plot plots the values of a variable against the corresponding percentage points of a theoretical distribution--normal, chi-square, t, F, uniform, binomial, logistic, exponential, gamma, Weibull, or Studentized range. Graphs like this are called probability plots, or P plots. You can also plot the expected values of one variable against those of another (P-P plot). These graphs are very important for determining if data are in need of transformation. See ourworld (pop_1986 ) Features of a Probability Plot No transformation Log (base ) transformation 15
Lets Play Activity 1: Graph construction Draw the most appropriate graph, given the data set and type provided Think about the nature of the information and how best to depict the information. Label both the x and y axis. Use appropriate scales for both axes. Think about the number of ticks on axis and labeling of tick marks Make sure the elements (points, bars, lines etc), are crafted in a way that simplifies interpretation (think about, color, pattern, shape of elements, whether or not to depict a trend) Provide a figure legend that is descriptive: the reader should be able to interpret the figure based on the graph and legend 16
Age (years) Average size of Seastars (Pisaster) over time Diameter (mm) 1 35 2 65 3 92 4 116 5 138 6 158 7 176 8 192 9 6 219 Total commercial abalone landings (pounds)over time in California Year Abalone Landings 1973 3,187,76 1974 2,587,8 1975 2,128,545 1976 1,7,111 1977 1,434,5 1978 1,292,517 1979 989,124 198 1,238,495 1981 1,9,463 1982 1,2,443 1983 8,25 1984 826,514 1985 823,931 1986 614,962 1987 762,951 1988 568,716 1989 741,1 199 523,942 1991 38,593 1992 514,8 1993 461,3 1994 2,596 1995 262,314 1996 229,379 1997 112,323 17
The relationship between time to run a mile and maximum oxygen consumption VO2 max (oxygen consumption, ml/(kg min) ) Runtime (minutes per mile) 59.57 8.17 6.6 8.63 54.3 8.65 54.63 8.92 49.16 8.95 49.87 9.22 48.67 9.4 45.44 9.63.55 9.93 46.67 45.31.7.39.8.54.13 46.77.25 51.86.33 45.79.47 47.47.5 47.27.6 49.9.85.84.95 45.12 11.8 44.75 11.12 46.8 11.17 44.61 11.37 47.92 11.5 44.81 11.63 45.68 11.95 39.41 12.63 39.2 12.88 39.44 13.8 37.39 14.3 Size distribution of Limpets Limpet size (mm) 6 7 18
Two variables: Number of Blue whales as a function of period and location Southern North Hemisphere Pacific North Atlantic Prewhaling ~175, 4,9 1, Current ~2, ~2, ~ Extra slides 19
Basics of analytical graph theory Graph types imply a basis of logic and are not always interchangeable Even interchangeable graph types are not always equivalent (some are just non-informative) Be very clear about what you are trying to convey: models, stats or data structure Graph construction (axes, scales etc) may obscure or make clear the points you are trying to make Graph trickery is usually just that and typically subtracts from the depiction The underlying basis of the graph There are two general bases for any data graph that will be presented or published. To display data (hopefully in the most efficient way) To convey information about statistics associated with the data Both of the above Although these may not appear to present a conflict often times there is here is an example
Error Bars Be very Careful - error bars convey meaning - at least two sorts Estimate of variability for subjects in that category, irrespective of strata or statistical assumptions Of use for showing spread in sampled data Of no use for conveying inferential statistics Estimate of variability for subjects in that category, with respect to strata and statistical assumptions Of no use for showing spread in sampled data Of use for conveying inferential statistics See typing How and why are these two graphs different? 8 without respect to strata and statistical assumptions 78 Least Squares Means with respect to strata and statistical assumptions 7 68 SPEED 6 SPEED 58 48 electric plain old EQUIPMNT word process 38 electric plain old EQUIPMNT word process 21
Basics of analytical graph theory Graph types imply a basis of logic and are not always interchangeable Even interchangeable graph types are not always equivalent (some are just non-informative) Be very clear about what you are trying to convey: models, stats or data structure Graph construction (axes, scales etc) may obscure or make clear the points you are trying to make Graph trickery is usually just that and typically subtracts from the depiction Which is best? INCOME SEX no grad hs hs grad some college EDUC college grad Female Male 22