Analytical Graphing lets start with the best graph ever made Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian campaign of 1812. Beginning at the Polish-Russian border, the thick band shows the size of the army at each position. The path of Napoleon's retreat from Moscow in the bitterly cold winter is depicted by the dark lower band, which is tied to temperature and time scales. The graph illustrates an amazing point how an army of 4, can dwindle to, without losing a single major battle. 1
When is a graph appropriate? Always for data exploration Often for data analysis and to develop predictions (models) and experimental designs Sometimes for presentations Less often for publications Data Exploration Is not snooping in the pejorative sense. Exploration is a necessary and desired operation for: Checking data for unusual values Making sure the data meet the assumptions of the chosen form of analysis Eg normality, homogeneity of variances, linearity (in regression approaches) deciding (sometimes) what sort of analysis to do. This hopefully will have been done prior to initiating a study To look for patterns that may not be expected or apparent this is indeed snooping but it is an essential part of hypothesis formation 2
Count Count Data Exploration Checking data for unusual values Making sure the data meet the assumptions of the chosen form of analysis See ourworld - pop_86 Determining distributions and outliers Will a transformation help?? 5 15 5.8 4 3.7.6.5.4.3.2.1 5. 15 Population of countries (1986) Proportion per Bar 15 5 1....3.2.1. 1... Population of countries (1986) Proportion per Bar 3
Data Exploration Is not snooping in the pejorative sense. Exploration is a necessary and desired operation for: Checking data for unusual values Making sure the data meet the assumptions of the chosen form of analysis Eg normality, homogeneity of variances, linearity (in regression approaches) The relationship between birth and death rates (ourworld) Is it linear, or is there perhaps a more appropriate model 3 DEATH_82 3 4 5 BIRTH_82 4
Clearly not linear using LOESS procedure (locally weighted scatterplot smoothing): a non-parametric regression method that combines multiple regression models in a k-nearest-neighbor-based meta-mode 3 DEATH_82 3 4 5 BIRTH_82 When is a graph appropriate? Often for data analysis (e.g.) To understand the nature of interaction terms (more later) To understand the power of a test. Say we wanted to determine sample size for an experiment where we thought the response would be around (alternate Hypothesis =) the standard deviation about 8 and we were willing to relax alpha (from.5 to.) 5
For example the effect of relaxing alpha on power Pop. Mean = Alternative = SD = 8 Alpha=.5,. Power 1..9.8.7.6.5.4.3 Power Curve (Alpha =.5).2 5 15 25 3 35 Sample Size (per cell) Power 1..9.8.7.6.5.4 Power Curve (Alpha =.).3 5 15 Sample Size (per cell) When is a graph appropriate? Sometimes for presentations Idea is to communicate information quickly Be sure you know why you are presenting the graph (is it to convey stats or some other information (we will talk about this more later) Graphs should be simple and not contain too much information never have a graph that is not interpretable So many factors involved that no one could figure it out, or worse 6
I know you can t really see this but. GN P _ 8 6 GD P _ C A P L OG_ GD P E D U C _ 8 4 E DUC HE A L T H8 4 HE A L T H P OP _ 1 9 8 3P OP _ 1 9 8 6P OP _ 1 9 9 P OP _ 2 2 B I R T H _ 8 2 B I R T H _ R T D E A T H _ 8 2D E A T H _ R TB A B Y M T 8 B 2 A B Y M OR TL I F E _ E X P GN P _ 8 2 GN P _ 8 6 GD P _ C A P L OG_ GD P E D U C _ 8 4 E DUC HE A L T H8 4 HE A L T H P OP _1983 P OP _1983 P OP _ 1 9 8 3P OP _ 1 9 8 6P OP _ 1 9 9 P OP _ 2 2 B I R T H _ 8 2 B I R T H _ R T D E A T H _ 8 2D E A T H _ R TB A B Y M T 8 B 2 A B Y M OR TL I F E _ E X P GN P _ 8 2 P OP _1986 P OP _1986 P OP _199 P OP _199 D E A TH _82 B A B Y MOR T B A B Y MT82 D E A TH _R T D E A TH _82 B I R TH _R T B I R TH _R T B I R TH _82 B I R TH _82 P OP _ P OP _ H E A LTH H E A LTH H E A LTH 84 H E A LTH 84 E DUC E DUC E D U C _84 E D U C _84 LOG_GD P LOG_GD P GD P _C A P GD P _C A P GN P _86 GN P _86 GN P _82 GN P _82 LI FE _E X P D E A TH _R T B A B Y MT82 B A B Y MOR T LI FE _E X P These are usually presented to demonstrate how much work the researcher has done really conveys that he or she has not adequately prepared the presentation When is a graph appropriate? Less often for publications Idea is to communicate information that is too complex to leave in tables or text They typically depict rather than present information (you have to read across to axes to get numbers). Hence if precise bits of information are important to the argument being made use tables. If a graph is presented it must be important to the argument being made in the text (no fluff graphs) Information cannot be presented twice (eg table and figure, text and figure) If a graph is presented it must be interpretable You should be able to understand the purpose and content of the figure directly from the legend. 7
Basics of analytical graph theory Graph types imply a basis of logic and are not always interchangeable Even interchangeable graph types are not always equivalent (some are just non-informative) Be very clear about what you are trying to convey: models, stats or data structure Graph construction (axes, scales etc) may obscure or make clear the points you are trying to make Graph trickery is usually just that and typically subtracts from the depiction Graph types imply a basis of logic and are not always interchangeable Summary Charts Density Charts Scatterplots, quantile plots and probability plots 8
Summary Charts There are a series of general graphical displays useful for characterizing the relationship between independent variables (usually categorical) and summary statistics of dependent variables (usually continuous). An example would be a bar graph of the relationship between education and income (see survey2 data). Some types of summary charts: Examples of continuous and categorical variables Categorical Gender (male, female) Nationality (French, Italian) Species (Human, Chimp) Color (red, green, blue) Age Group (Young, Old) Height Group (Short, Medium, Tall) Weight Group (Thin, Obese) Speed (Fast, Slow) Continuous Hormone level Location (Latitude, Longitude) Phylogenetic distance Color (wavelength) Age (years, days) Height (cm, inches) Weight (grams, pounds) Speed (cm/sec) Temperature (Cold, Warm) Temperature (degrees C) 9
7 Bar Dot Line 7 7 6 6 6 5 5 5 4 3 4 3 4 3 7 Profile Pyramid Pie 7 6 5 6 5 4 3 4 3 Which conveys the information most clearly how about the comparisons of interest 7 6 5 4 3 SEX Female Male Female SEX Male 7 6 5 4 3 7 6 5 4 3 3 4 5 6 7 SEX Female Male
Density Charts The density of a sample is the relative concentration of data points in intervals across the range of the distribution. A histogram is one way to display the density of a quantitative variable; box plots, dot or symmetric dot density, frequency polygons, fuzzygrams, jitter plots, density stripes, and histograms with data-driven bar widths are others. Histogram Length (mm) 11
Features of a BOXPLOT Rather than comparing sample values to the normal distribution (mean, standard deviation, and so on), box plots show robust (what does this mean) statistics (median, quartiles, and so on). confidence interval hinge median hinge outliers mean 25% 25% 25% 25% Smallest 5% Statistical Range Y Raw Data Plots:e.g. Scatterplots, Scatterplots are probably the most common form of graphical display. The key feature of scatterplots is that raw data are plotted (in contrast to summary data as in summary charts). Regression lines with confidence bands or smoothers (e.g. linear, non-linear) can be added to help explain relationships among variables. An example is the relationship between mussel height, and length and mussel height and mussel mass. How to estimate length and mass of mussels? 12
Height Length Non-linear and linear smoothing Each point is a mussel 13
Scatterplots, quantile plots and probability plots Quantile plots and probability plots are useful for studying the distribution of a variable. Quantile Plots produces quantile plots, or Q plots. Unlike probability plots, which compare a sample to a theoretical probability distribution, a quantile plot compares a sample to its own quantiles (a one-sample plot) or to another sample (a two-sample, or Q-Q, plot). The quantile of a sample is the data point corresponding to a given fraction of the data. See ourworld (pop_1986 ) Features of a Quantile Plot Distribution of data Fraction of Data 1..9.8.7.6.5.4.3.2.1 86% of countries had populations less or equal to 5 million people. 5 15 POP_1986 Distribution of quantiles (should be uniform but is subject to sample size) 14
Scatterplots, quantile plots and probability plots A Probability Plot plots the values of a variable against the corresponding percentage points of a theoretical distribution--normal, chi-square, t, F, uniform, binomial, logistic, exponential, gamma, Weibull, or Studentized range. Graphs like this are called probability plots, or P plots. You can also plot the expected values of one variable against those of another (P-P plot). These graphs are very important for determining if data are in need of transformation. See ourworld (pop_1986 ) Features of a Probability Plot No transformation Log (base ) transformation 15
Extra slides Basics of analytical graph theory Graph types imply a basis of logic and are not always interchangeable Even interchangeable graph types are not always equivalent (some are just non-informative) Be very clear about what you are trying to convey: models, stats or data structure Graph construction (axes, scales etc) may obscure or make clear the points you are trying to make Graph trickery is usually just that and typically subtracts from the depiction 16
The underlying basis of the graph There are two general bases for any data graph that will be presented or published. To display data (hopefully in the most efficient way) To convey information about statistics associated with the data Both of the above Although these may not appear to present a conflict often times there is here is an example Error Bars Be very Careful - error bars convey meaning - at least two sorts Estimate of variability for subjects in that category, irrespective of strata or statistical assumptions Of use for showing spread in sampled data Of no use for conveying inferential statistics Estimate of variability for subjects in that category, with respect to strata and statistical assumptions Of no use for showing spread in sampled data Of use for conveying inferential statistics See typing 17
How and why are these two graphs different? 8 without respect to strata and statistical assumptions 78 Least Squares Means with respect to strata and statistical assumptions 7 68 SPEED 6 SPEED 58 5 48 4 electric plain old EQUIPMNT word process 38 electric plain old EQUIPMNT word process Basics of analytical graph theory Graph types imply a basis of logic and are not always interchangeable Even interchangeable graph types are not always equivalent (some are just non-informative) Be very clear about what you are trying to convey: models, stats or data structure Graph construction (axes, scales etc) may obscure or make clear the points you are trying to make Graph trickery is usually just that and typically subtracts from the depiction 18
Which is best? 5 4 3 SEX Female Male 19