GEOGRAPHY 204: STATISTICAL PROBLEM SOLVING IN GEOGRAPHY

CHAPTER 2: GEOGRAPHIC DATA Primary data - acquired directly from the original source In situ or in the field Costly time and money! Campus bike racks

GEOGRAPHIC DATA Secondary (or archival) data - collected by an organization or government agency Processed/organized accessible and formatted Less time and expense Often comprehensive census Ex. US Census data, Maryland land use, USGS National Hydrologic dataset Potential issues/errors: improperly collected/summarized out-of-date as soon as its measured!

Ponds in Wicomico County: Three primary sources Differences.hmm

PRIMARY DATA COLLECTION Often sampling is necessary. Direct observation E.g., Traffic counts, etc.. Field measurement Mail questionnaires Personal/telephone interviews Survey Design: Very important! Pilot tests Question interpretation/wording Erroneous responses?!? Logistics Absences Refusals

GEOGRAPHIC STUDIES - TYPES Explicitly spatial - locations or placement of the observations or units of data are themselves directly analyzed Spatial statistics: examine patterns for randomness Ex. diseased trees in national forest, farms in a watershed Clustered or randomly distributed Implicitly spatial observations or units of data represent locations or places, but the locations themselves are not directly analyzed Ex. relationship between house values and age Neighborhood not individual locational pattern

Blue: wet counties Red: dry counties Yellow: restrictions partially dry, municipalities within dry

GEOGRAPHIC STUDIES Individual-level data sets - each data value represents an individual element of the phenomena under study Ex. tree circumference, SU student interview about parking Random sample of Nigerian women seeking information about fertility Spatially-aggregated data sets each value represents a summary or spatial aggregation of individual units of information for a particular place or area Ex. Maryland median income by county, Alcohol laws by county Birth rate estimates for all administrative divisions in Nigeria to estimate nationwide fertility Ecological fallacy invalid transfer of conclusions from spatiallyaggregated analysis to smaller areas or individual level Transfer results or apply conclusions down Ex. Nigerian district with low birth rates some women in district may not use birth control for family planning Aggregating individual-level data to a larger spatial unit is generally NOT problematic! Often used as the method to determine higher level, spatial estimates

DATASET VARIABLE CHARACTERIZATION Discrete variable some restriction placed on the values a variable can assume Result from counting or tabulating the number of items (whole integer) Ex. number of households, number of active volcanoes Continuous variable infinitely-large number of possible values along some interval of a real number line Result from measurement values expressed as decimals Ex. precipitation, area in forest, commute distance/time Importance: Probability distributions change coming soon!

DATASET VARIABLE CHARACTERIZATION Quantitative data observations or responses expressed numerically Units of data are assigned numeric values Qualitative data each observation is assigned to one or more categories Ex. Type of land cover: agriculture, forest, residential, commercial, etc., primary cash crop Frequency counts - number of observations assigned to non-numerical categories

Frequency 600 500 400 300 200 100 0 Aesthetic Agriculture Borrow Pit Impoundment Stormwater Other Category

Count 1000 Small Water Bodies 900 800 700 600 500 OBJECTID Cat_1 Area_ha Area_acre DataSource 1 Agriculture 0.0502111 0.1240741 USGS 2 Extractive 0.105233 0.2600361 USGS 4 Extractive 0.9638432 2.3817065 USGS 5 Stormwater 0.2240009 0.5535177 USGS 7 Extractive 0.0817403 0.2019845 USGS 8 Extractive 0.0264723 0.0654144 USGS 400 300 200 100 0 1000 5000 8000 50000 500000 Area (m2)

LEVELS OF MEASUREMENT Measurement levels of data inform the selection of the appropriate statistical technique

NOMINAL Each variable is given a name and assigned to at least two qualitative classes or categories Only relationship between categories is different Simplest scale, non-numerical Condition: categories must be exhaustive and mutually exclusive Exhaustive every value or unit of data can be assigned to a category Mutually exclusive cannot assign a value to more than one category No overlap Other can be used

NOMINAL DATA: EXAMPLES Individuals religious affiliation dichotomous: only two options Gender, yes-no, presence-absence Cities primary industry? Countries language family

OBJECTID Cat_1 DataSource 1 Agriculture USGS 2 Extractive USGS 4 Extractive USGS 5 Stormwater USGS 7 Extractive USGS 8 Extractive USGS

ORDINAL SCALE Quantitative distinctions can be made Rank order greater than or less than Strongly-ordered: each value or unit of data is given a particular position in a rank order sequence Ex. 10 best college towns, Countries GNP Each assigned preference rank (1 to 10) Remember: 2 nd ranked town is not twice as good as 4 th ranked town

ORDINAL SCALE Weakly-ordered: values are placed in categories and resulting categories are ranked Ex. Choropleth map - % population change for US counties between 2000 and 2009 5 categories (< 0%, 0 to 10%, 10.1% to 15%, 15.1% to 25%, greater than 25%) Generate frequency counts for the number of counties in each category weakly ordered Two counties may have different values but in same category

Best College Towns: American Institute for Economic Research 9. Charlottesville, VA 8. Blacksburg, VA 7. Champaign, IL 6. Corvallis, OR 5. Iowa City, IA 4. Crestview, FL 3. State College, PA 2. Ames, IA 1. Ithaca, N.Y. Often based on composite variables

INTERVAL AND RATIO SCALES Magnitude of differences between values can be determined Length of interval between any two units of data can be measured on a scale Interval data origin or zero starting point is arbitrary Ex. Fahrenheit and Celsius temperature scale Ratio scale - natural zero is used, ratios between values can be determined Ex. Rainfall: 40 inches, Montreal, 10 inches Chihuahua 40/10 = 4, four times as much rainfall Examples: Kelvin, distance, area, median income, etc.

MEASUREMENT SCALE Observations from same variable can be expressed at different measurement scales depending on how they are measured, organized, and displayed Ex. Resource planner Type of Energy Use in Homes Individual households Nominal primary type of energy (coal, gas, oil, wood, etc.) County-level summaries Strongly-ordered ordinal: % of households using natural gas by county Weakly-ordered ordinal: Choropleth map of % of households using natural gas Ratio scale: number of households by county using natural gas

MEASUREMENT CONCEPTS Measurement Error: Precision and Accuracy Precision: level of exactness associated with measurement Ex. rain gauge tipping bucket calibration every.10 inch vs..01 inches 1.2 to 1.3 inches, 1.21 to 1.22 inches Spurious precision: Computer/calculator produces many decimal places..real? meaningful? 5.2/3 = 1.7333333333333333333333

MEASUREMENT CONCEPTS Accuracy: the extent of system-wide bias in the measurement process Ex. rain gauge Precise instrument, calibrated badly 1.19 inches recorded, 1.27 inches actually fell How do you know? Difficult.

MEASUREMENT CONCEPTS Validity: measurement issues related to the nature, meaning, or definition of concept or variable Assigning true or appropriate meaning to concepts through measurement of a simple variable or set of variables Complex concepts: Ex. quality of life, economic well-being Operational definitions: true meaning is not possible Indirect or surrogate method to best define complex concept Ex. quality of education evaluate by average student score on California Achievement Exam in elementary school, percent of graduates who subsequently go to college in high school Degree of validity difficult to assess, often ignored Good research MUST be addressed!

MEASUREMENT CONCEPTS Reliability: consistency and stability of measure/data Geographic data: temporally and spatially varying Consistent data collection methods? Ex. water quality same depth, time since rainfall event, etc. Consistent classification/categorization methods? Ex. poverty same definition 2010, 2000, 1970? Problematic: developing countries Assess reliability test-retest procedure Behavioral geography survey or questionnaire Collect data from respondents at twice!

BASIC CLASSIFICATION METHODS Why and how do we classify or categorize data? Classification organizes, simplifies and generalizes large amounts of information into effective or meaningful categories clarifies communication, reveals spatial patterns organized according to degree of similarity Minimizes within group dispersion and maximizes between group differences Categories must be mutually exclusive and exhaustive!

BASIC CLASSIFICATION METHODS Result: Information lost Generalization and simplification Individual-level values aggregated: spatial units classes

CLASSIFICATION Conceptual Strategies Subdivision (logical subdivision): all units of data in a population are grouped together and then individual values are allocated to an appropriate subdivision using carefully defined criteria Clear, consistent set of rules used to assign values to proper class Top down, hierarchical approach Characteristics of each category pre-determined

CLASSIFICATION: LOGICAL SUBDIVISION Ex. USGS National Land Cover Dataset (NLCD) Landsat-based 30m pixels Level I and II

CLASSIFICATION: LOGICAL SUBDIVISION

CLASSIFICATION: LOGICAL SUBDIVISION Ex. North American Industry Classification System (NAICS)

CLASSIFICATION Agglomeration: each observation in a population or data set is separate and distinct from others to begin classification Examine each value and allocate to classes using well-defined grouping criteria Combine like, separate unlike Bottom up approach Frequently used in geography numerically or graphically aggregated

CLASSIFICATION: AGGLOMERATION

CLASSIFICATION: OPERATIONAL PROCEDURES Practical application: often mixture of subdivision and agglomeration

SINGLE-VARIABLE CLASSIFICATION METHODS Equal intervals based on range Range: difference in magnitude between the largest and smallest value in an interval-ratio data set Class breaks: the values that separate one class from another Procedure range is divided into the desired number of equal-width class intervals Ex. High=1856, Low=213, Range=1643, 4 classes (410.75) Classes: 213-623.75 623.76-1034.5 1034.6-1445.25 1445.26-1856 Considerations: based on extreme values, break values precise, unequal number of values in each category,

SINGLE-VARIABLE CLASSIFICATION METHODS Equal intervals not based on range Same equal interval class breaks however based on practical/convenient values, not range Often rounded Preferred for constructing frequency distribution, histogram, or ogive (graphical representations) Frequently used by government agencies Ex. High=1856, Low=213, Range=1643, 4 classes (410.75) Classes: 213 to 623.9 624 to 1034.9 1035 to 1445.9 1446 to 1856.9 Considerations: Easy to understand and interpret, number of values in each category varies widely

SINGLE-VARIABLE CLASSIFICATION METHODS Quantile Total number of values is divided as equally as possible into the desired number of classes Equalize number of values in each class Quartiles (four classes),quintiles (five classes) Considerations: Choropleth mapping produces even distribution of areas within classes on map area Class breaks not rounded, uneven interval widths Data clustered, split unnaturally

SINGLE-VARIABLE CLASSIFICATION METHODS Natural Breaks Single linkage: identify natural breaks in the data and separate values into different classes based on these breaks Iterative identify largest gap between values on number line, then the second largest gap, until desired number class breaks is achieved Groups similar values, highlights extreme values Clusters large number of values in one or two categories

CHOROPLETH MAPPING Ideal number of classes? Trade-off between generalization and sufficient detail 4 to 7 tend to be effective 5 classes in the following examples 4 class breaks

Dendrogram graphically depicts step-by-step, single linkage natural breaks process Outliers extreme values in data set Adversely affect natural breaks classification method

CLASSIFICATION RESULTS All portray the same data differently! Starkest contrasts? Considerations Do three states really have an obesity rate of exactly 24.6%?