Statistical Thinking, Data Types, and Geographical Primitives The scientific method in geography, two kinds of approaches, and the sorts of statistics used to support those approaches Some characteristics of data and considerations when collecting measurements, making observations and using data The geographical primitives that are often used to generate measurements in geography from the attributes of geographical features
Applying the Scientific Method Both physical scientists and social scientists (in our context a.k.a. physical and human geographers) often make use of the scientific method in their attempts to learn about the world organize surprise Concepts Description Hypothesis formalize validate Theory Laws Model
Two Sorts of Approaches The scientific method gives us a means by which to approach the problems we wish to solve The core of this method is the forming and testing of hypotheses A very loose definition of hypotheses is potential answers to questions Geographers use quantitative methods in the context of the scientific method in at least two distinct fashions:
Two Sorts of Approaches Exploratory methods of analysis focus on generating and suggesting hypotheses Confirmatory methods are applied in order to test the utility and validity of hypotheses organize surprise Concepts Description Hypothesis formalize validate Theory Laws Model
Two Sorts of Statistics for Two Approaches Statistics can be divided into two major types, with each type most useful in the context of one of the two approaches in geography Descriptive statistics tend to useful in the context of exploratory approaches because their function is primarily to summarize a dataset in a way that emphasizes some characteristics Inferential statistics are applied in the context of confirmatory approaches because their function is to test the veracity and validity of an idea or inference
Descriptive Statistics Descriptive statistics provide an organization and summary of a dataset A small number of summary measures replaces the entirety of a dataset e.g. Suppose a TV station s weather bureau records the temperature on a hourly basis each day, giving 24 values. Rather than reporting all 24 values, they usually tell you the high and low temperature for the day (and possibly the range and an average value as well).
Descriptive Statistics In the act of summarizing a dataset using descriptive statistics, there is necessarily some loss of information It is through the decisions made in how the salient aspects of a dataset are summarized that there is the potential to give a misleading impression of data. The statistician can distort the information contained in the dataset through the selection of particular descriptive stats, or through the aggregation of data in such a way that the stats are misleading
Descriptive Statistics The temptation for a scientist to select descriptive methods that emphasize their notions about their datasets can be very strong indeed While it would be nice to be able to approach every problem with an entirely open mind, the reality is that scientists almost always have some preconceived notions about what they expect to find in their data As a result, there is a tendency to select statistical measures that most strongly convey the pattern that is expected to be found a priori
The Nature of Statistics Statistical methods are designed to derive conclusions based upon empirical data, derived by observations Mathematics operates using deductive reasoning Statistics relies on inductive reasoning: Statistical approaches are used to extrapolate conclusions that apply to more than just the limited set of available observations e.g. when someone infers from a small poll some truth about the parent population using a limited set of data
Some Characteristics of Data Not all data is the same, and depending on some characteristics of a particular dataset, there are some limitations as to what can and cannot be done with that data. Some key characteristics that must be considered are: A. Scale of Measurement B. Continuous vs. Discrete C. Grouped vs. Individual
A. Scales of Measurement Data the plural of datum, which are generated by the recording of measurements Measurements involves the categorization of an item (I.e. assigning an item to a set of types) when the measure is qualitative OR makes use of a number to give something a quantitative measurement
A. Scales of Measurement The data used in statistical analyses can be divided into four types 1. The Nominal Scale 2. The Ordinal Scale 3. The Interval Scale 4. The Ratio Scale As we progress through these scales, the types of data they describe have increasing information content
The Nominal Scale Nominal scale data are data that can simply be broken down into categories, i.e. having to do with names or types Dichotomous or binary nominal data has just two types, e.g. yes/no, female/male, is/is not, hot/cold etc. Multichotomous data has more than two types, e.g. vegetation types, soil types, counties, eye color etc. Not a scale in the sense that categories cannot be ranked or ordered (no greater/less than)
The Ordinal Scale Ordinal scale data are data that can be categorized, but also can be placed in an order, i.e. categories that can be assigned a relative importance and can be ranked such that numerical category values have some meaning, e.g. star-system restaurant rankings 5 stars > 4 stars, 4 stars > 3 stars, 5 stars > 3 stars BUT ordinal data still are not scalar in the sense that differences between categories do not have a quantitative meaning, i.e. a 5 star restaurant is not superior to a 4 star restaurant by the same amount as a 4 star restaurant is than a 3 star
The Interval Scale When using the interval scale, there is a meaningful quantity to a numerical category name, so we can not only access that A > B, but we can also look at how much greater A is than B (A-B) To put it another way, the units of a scale can be used here, e.g. temperature scales, elevation etc. So, we can meaningfully look at the difference in the value of two interval scale observations BUT we still cannot multiply or divide them meaningfully, because the value of zero is arbitrary
The Ratio Scale Similar to the interval scale, but with the addition of having a meaningful zero value, which allows us to compare values using multiplication and division operations, e.g. precipitation, weights, heights etc. We can say that 2 inches of rain is twice as much rain as 1 inch of rain because this is a ratio scale measurement, whereas 2 degrees Celcius is not twice as warm as 1 degree Celcius because 0 degrees Celcius does not denote a total absence of warmth (degrees Celcius is interval scale)
B. Continuous vs. Discrete Data Continuous data can include any value (i.e. real numbers), e.g. 1, 1.43. 1 ¾ are all acceptable values. A geographic example would be a distance measured between two points. From the integer or ratio scale Discrete data only consists of discrete values, and the numbers in between those values are not defined (i.e. whole or integer numbers), e.g. 1, 2, 3. The number of people who have malaria would be a discrete value
C. Grouped vs. Individual Data The distinction between individual and grouped data is somewhat self-explanatory, but the issue pertains to the effects of grouping data For example, in census data we might find a mean value of family income for some level of census geography (like a census tract or county) While a family income value is collected for each household (individual data), for the purpose of analysis it is transformed into a set of classes (e.g. <$10K, $10-20K, >$20K)
C. Grouped vs. Individual Data In grouped data, the raw individual data is categorized into several classes, and then analyzed. The act of grouping the data, by taking the central value of each class, as well as the frequency of the class interval, and using those values to calculate a measure of central tendency (like our mean value for a census tract) has the potential to introduce a significant distortion Grouping always reduces the amount of information contained in the data
Basic Issues in Data Collection The reliability of measurements is a key consideration in data collection; is bias being introduced? The validity of your data also needs to be considered is your instrument measuring what it claims to be measuring? What is the precision of your instrument; how exact is it in its measurements? What is your instrument s level of accuracy? Is it calibrated properly, or is it introducing bias?
Precision and Accuracy These related concepts are often confused: Precision refers to the exactness associated with a measurement (i.e. closely clustered) Accuracy refers to the extent of systematic bias in the measurement process (i.e. centered on the middle) x x x x x x x x x x x x x x x x x x x x Precise & Accurate Precise & Inaccurate Imprecise & Accurate Imprecise & Inaccurate
Geographic Primitives In geography applications, the observations which we are going to make use of are derived from some characteristics of a geographic feature which has been mapped: Point features: These are features with only a location, no length or area. e.g. On campus, the following are well represented as point features: The Old Well, the flag pole, etc. Line features: These are feature with several locations strung out along the line in sequence, and are too narrow to represent their width. e.g. roads, rivers, etc. Area features: These consist of one or more lines that form a loop. e.g. shorelines enclosing a lake.
Geographic Primitives (x,y) (x,y) (x,y) (x,y) (x,y) point line polygon (area) A point: specified by a pair of (x,y) coordinates, representing a feature that is too small to have length and area. A line: formed by joining two points, representing features too narrow to have areas A polygon (area): formed by a joining multiple points that enclose an area (x,y) (x,y) (x,y) (x,y) (x,y)
Geographic Primitives - Points Points are often sometimes used to denote a location, and when used in the sense of a Euclidean point, they are 0-dimensional (having no width, length, area etc.) However, points are more often used in the sense of a centroid, approximating the center of something that in fact does have an extent, but can be adequately approximated as being 0-D. e.g. the North Pole, the Old Well, the geographic center of the United States
Pond Branch Catchment Control Color Infrared Digital Orthophotography
Soil Moisture Sampling Method 25 samples taken using a random walk within a 5 meter circle ThetaProbe Soil Moisture Sensor - measures the impedance of the sensing rod array, a f(x) of the soil s moisture content 5 meter diameter + + + + + + + + + + + + + + + + + + + + + + +
Geographic Primitives - Lines Lines are primarily applied to the purpose of showing the length of a feature, linkages between features The key sort of information we extract from linear features is distances along them of various sorts, although measures of their sinuousity are also sometimes of interest e.g. The length of a river, the distances between two cities, or the degree to which a river meanders
Transects & Segments
Geographic Primitives - Areas Areas often provide the source of a measure of an attribute over a given area, including density values e.g. The levels of census geography states, counties, census tracts etc. can give us a value like the number of motor vehicle accidents per 100,000 people in each state In NC, this value is the range of 22 25 but what would this value be for local counties like Alamance, Chatham and Orange? To assume that this would be the same at different scales would be to fall victim to Ecological Fallacy (related to the Modifiable Area Unit Problem)
MODIS LULC In Climate Divisions Maryland CD6 North Carolina CD3
Geographic Primitives - Surfaces Surfaces are different in the sense that they cannot be thought of as a single feature, and that they also represent the third dimension You can obtain information about altitude or volumes from a surface, as well as other quantities which can derive from the shape of the surface e.g. a topographic map, from which any number of derived quantities can be obtained, such as slope and aspect, which in turn can provide drainage direction information etc.
Pond Branch Catchment Control Topographic Index Example
Topographic Moisture Index TMI = ln(a/tanβ) Hornsberger, G.M., Raffensberger, J.P., Wiberg, P.L. and K.N. Eshleman. 1998. Elements of Physical Hydrology, Johns Hopkins Press, U.S.A., p. 210 & p. 216.
Geographic Primitives - Distance Distances can be calculated between points, along lines, or in a variety of fashions with areas Euclidean Distance calculated in a Cartesian frame of reference: P 2 (x 2,y 2 ) C= (x 1 x 2 ) 2 + (y 1 y 2 ) 2 Over what distances on Earth is this valid? Why? Can we use this with latitude and longitude? C P 1 (x 1,y 1 )
Geographic Primitives - Distance An alternative formulation for distance that is useful in urban environments with orthogonal road networks is Manhattan Distance, which is still calculated in a Cartesian frame of reference, but movement is limited to city streets: P 2 (x 2,y 2 ) d m = x 1 x 2 + y 1 y 2 a reminder the symbols denote absolute value P 1 (x 1,y 1 )
Geographic Primitives Area You will recall formulae for calculating the area of regular figures from geometry: rectangular a = l * w l w circle a = πr 2 r Vector GIS calculates the area of a polygon by summing rectangular and triangular areas
Geographic Primitives Shape You are likely less familiar with indices of shape An example of such a value describes the extent to which a shape is compact vs. elongated, an index of compactness that measures the deviation of a shape from circular: S = d / l, where l is the length of the longest diagonal within a shape that spans it S = 1 S ~ 0.5 S = 0
Geographic Primitives Density Density is the concentration of a given attribute over an area, and can be formulated in any number of ways: e.g. points per area as in my spot height densities e.g. length per area as in my transect densities If you can make of a count of something per area, you can create a density measure for that quantity
Sources of Digital Elevation Data Catchment Area (ha) Pond Branch (control) Glyndon (urbanizing) 37.55 81.05 Data Source Number of Points Points per m2 Photogrammetric 6569 0.017 LIDAR 273228 0.727 Photogrammetric 39687 0.049 LIDAR 437759 0.540
Upper Baismans Run Sampling 0.5 0 0.5 1 1.5 Kilometers W N E Upper Baismans Run Sample 1 Upper Baismans Run Sample 2 Upper Baismans Run Sample 3 S 3 Samples, 100 meters/ha, 100 meter long transects