Sampling Populations Typically, when we collect data, we are somewhat limited in the scope of what information we can reasonably collect Ideally, we would enumerate each and every member of a population so we could know its parameters perfectly In most cases this is not possible, because of the size of the population (infinite populations?) and associated costs (time, money, etc.) Usually it is not necessary, because by collecting data on an appropriate subset of the population we can create statistics that are adequate estimates of population parameters Instead, we sample a population, trying to get information about a representative subset of the population
Sampling Concepts We must define the sampling unit - the smallest subdivision of the population that becomes part of our sample We want to minimize sampling error when we design how we will collect data: Typically the sampling error as the sample size because larger samples make up a larger proportion of the population (and a complete census, for example, theoretically has no sampling error) We want to try and avoid sampling bias when we design how we will collect data: Bias here is referring to a systematic tendency in the selection of members of a population to be included in a sample, i.e. any given member of a population should have an equal chance of being included in the sample (for random sampling)
Probability Sampling Designs - Random Random sampling - In general, we need some degree of randomness in the selection of a sample to be able to draw any meaningful inferences about a population, but in some cases this may conflict with representativeness These are drawn in such a way that every unit of a population has an equal chance of being chosen and the selection of one unit has no impact on whether or not another individual will be selected (independence) This can be done with or without replacement (which determines whether the same unit can be drawn twice) We can generate random numbers using a table of random numerals, or using a computer, and we can scale to any required range of values
Transect Placement Software selects a random starting position for each transect, applying criteria Software assigns a random to each direction transect
Probability Sampling Designs - Systematic Representative approaches place restrictions on selection: Systematic sampling - This approach uses every k th element of the sampling frame, by beginning at a randomly chosen point in the frame, e.g. given a sampling frame of size = 200, to create a sample of size n=10 from such a sample, select a random point to begin within the frame and then include every 20 th value in the systematic sample This approach assumes that the assignment of the individuals in the sampling frame is random (i.e. they have not been placed in the frame in some order or grouping), and this should be checked before systematically sampling from a frame
Probability Sampling Designs - Systematic Some problems with systematic sampling: The possible values of sample size n are somewhat restricted by the size of the sampling frame, since the interval should divide evenly into the size of the sampling frame If the population itself exhibits some periodicity, then a stratified sample is likely to not be representative In geographic applications, with could be applied in 2 dimensions in (x,y) space with with xand y(which are not necessarily the same) specifying a systematic grid, but the sample size is still restricted by the extent of the study area (since the grid must fit evenly)
Probability Sampling Designs - Stratified We may need to place restrictions on how we select units for inclusion in a sample to ensure a representative sample. Stratified sampling - Divide the population into categories and select a random sample from each of these This approach can be used to decrease the likelihood of an unrepresentative sample if the classes/categories/strata are selected carefully (the individuals within a strata must be very much alike, which means that the population must be able to divided into relatively homogeneous groups) We need to know something about the population in order to make good decisions about stratification
Probability Sampling Designs - Stratified We can take a stratified sample that is Proportional - Where the random sample drawn from each class/category/stratum is the same size OR Disproportional - Where random samples of different sizes are drawn from each class/category/stratum, with the sample size usually being chosen on the basis of the size of that sub-population. This approach is best used when the sizes of the categories are significantly different, although it can also be applied to mitigate cost issues (i.e. it may be more costly to sample in a swamp than in a grassy field, so we might choose to take less samples in the swamp, although this clearly would be nothing to enhance representativeness in our sample)
Pond Branch Catchment Control Color Infrared Digital Orthophotography
Pond Branch Catchment Stratified TMI Sampling Pond Branch TMI Histogram TMI Values at Soil Moisture Sampling Locations using 11.25m PG DEM Percent of cells in catchment 48 44 40 36 32 28 24 20 16 12 8 4 0 4 5 6 7 8 9 10 11 12 13 14 15 16 Topographic Moisture Index 4 5 6 7 8 9 10 11 12 13 14 Topographic Moisture Index Pond Branch Glyndon
Probability Sampling Designs - Stratified WARNING: A class/category/stratum that is homogeneous with respect to one variable may have high variation with respect to another variable! Thus, stratification must be performed with some foreknowledge of how the sample will be analyzed, and if the sampling is being performed in a preliminary fashion (still seeking the relationships), there is a danger that the stratification will be found to be inappropriate after the fact E.g. my soils sampling may have been stratified with respect to TMI, but if I want to check if upstream landuse is a factor in Glyndon, I may find my samples are not representatively distributed with respect to land use
Random Spatial Sampling We can choose a random point in (x,y) space by choosing pairs of random numbers this produces a Poisson distribution if we divide the area into quadrats and count This is easy with rectangular study areas, otherwise we also need to reject any points outside the study area (e.g. my method for selecting the beginning of a transect) We can also produce stratified and systematic point samples by dividing the area into a group of mutually exclusive and collective exhaustive strata:
Data Portrayal Once we have sampled some geographic phenomenon, it is often useful to portray it in some fashion that allows you to get a sense of the values in the dataset Many portrayal approaches still involve reducing the volume of data (and information content), but if applied properly, they can help you see the interesting characteristics of data For the various scales of measurement, there are different approaches that are applicable
Scales of Measurement Thematic data can be divided into four types 1. The Nominal Scale 2. The Ordinal Scale 3. The Interval Scale 4. The Ratio Scale As we progress through these scales, the types of data they describe have increasing information content
Nominal Data From one of my dissertation transect samples, the set of types of segments are nominal data: Class Frequency % of Total Woody 105 32.92 Herbaceous 151 47.34 Water 1 0.31 Normalizing Ground 6 the data, 1.88 Road 23 expressing it 7.21 relative to the Pavement 22 total (some 6.90 Structures 11 caveats here) 3.45
Nominal Data Class Frequency % of Total Woody 105 32.92 Herbaceous 151 47.34 Water 1 0.31 Ground 6 1.88 Road 23 7.21 Pavement 22 6.90 Structures 11 3.45 This is a tabular presentation of data has the advantage of giving the exact quantities, but can be busy, especially in larger tables
Nominal Data Class Frequency Woody 105 Segment Type Frequency 160 140 Herbaceous 151 120 100 Water 1 80 60 40 Ground 6 Road 23 20 0 Pavement 22 Segment Types Structures 11 The frequency of nominal data classes can be well displayed by a bar graph Frequency Woody Herbaceous Water Ground Road Pavement Structures
Class Woody 32.92 Herbaceous 47.34 Water 0.31 Ground 1.88 Road 7.21 Pavement 6.90 Structures 3.45 Nominal Data % of Total Structures 3% Pavement 7% Segment Types Once normalized, the values are well displayed in a pie chart, which emphasizes each category s portion of the whole Road 7% Ground 2% Water 0% Herbaceous 48% Woody 33% Woody Herbaceous Water Ground Road Pavement Structures
Ordinal, Interval, & Ratio Data From my dissertation, the set of all topographic moisture index values drawn from a raster data layer is an example of an interval dataset:
Ordinal, Interval, & Ratio Data Pond Branch is a 37.55 hectare watershed, which is equivalent to 375,500 m 2 (1 hectare = 10,000 m 2 ) Using 11.25m x 11.25m pixels (126.5625 m 2 ), there are ~ 2966 pixels from which we can draw TMI values
Ordinal, Interval, & Ratio Data It would clearly be impractical to try and get a sense of the distribution of TMI values in Pond Branch by looking at a table of 2966 values We need a data reduction approach by which we can reduce the number of values to a manageable amount, which in turn lends itself to some sort of graphical display For ordinal, interval, and ratio scale data, we can make use of histograms for this purpose, and building a histogram involves following a multistep procedure
Building a Histogram 1. Developing an ungrouped frequency table That is, we build a table that counts the number of occurrences of each variable value from lowest to highest: TMI Value Ungrouped Freq. 4.16 2 We could attempt to 4.17 4 construct a bar chart from this table, but it 4.18 0 would have too many bars to really be useful 13.71 1
Building a Histogram 2. Construct a grouped frequency table This table has classes of values (in a sense we are reducing our data back to the ordinal scale for display purposes) The decision on how to perform the grouping is a subjective one, but there are some common guidelines: Use class intervals with simple bounds and a common width (i.e. categories have same range) Adjacent intervals should not overlap (each datum should fit into one class)
Building a Histogram 3. Select an appropriate number of classes There are formulae available to make this decision objectively, but in reality it is a somewhat subjective decision If you have more observations, you usually need more classes, because when you put observations together in a class, you are considering them to have the same value for display purposes there is a trade-off here between simplicity and loss of information (e.g. Pond Branch TMI - 2966 observations grouped into 10 classes)
Building a Histogram 3. Select an appropriate number of classes cont. Class Frequency 4.00-4.99 120 5.00-5.99 807 6.00-6.99 1411 7.00-7.99 407 8.00-8.99 87 9.00-9.99 33 10.00-10.99 17 11.00-11.99 22 12.00-12.99 43 13.00-13.99 19
Building a Histogram 4. Plot the frequencies of each class All that remains is to create the plot: Pond Branch TMI Histogram Percent of cells in catchment 48 44 40 36 32 28 24 20 16 12 8 4 0 4 5 6 7 8 9 10 11 12 13 14 15 16 Topographic Moisture Index
Frequencies & Distributions A histogram is one way to depict a frequency distribution. A loose definition of a frequency: The number of times a variable takes on a particular value (note that any variable has a frequency distribution) E.g. roll a pair of dice several times and record the resulting values (constrained to being between and 2 and 12), counting the number of times any given value occurs (the frequency of that value occurring), and take these all together to form a frequency distribution
Frequencies & Distributions Frequencies can be absolute (when the frequency provided is the actual count of the occurrences of that particular frequency) or they can be relative (when they are normalized by dividing the absolute frequency by the total number of observations to yield a relative frequency between 0 and 1) Relative frequencies are particularly useful if you want to compare distributions drawn from two different sources, i.e. while the numbers of observations of each source may be different, by normalizing them, they can be reasonably compared
Glyndon Segment Length Distributions Upper Baismans Run Percent of all segments in class Percent of all segments in class 100 80 60 40 20 0 100 80 60 40 20 0 0 10 20 30 40 50 60 70 80 90 100 Segment length (meters) Woody Herbaceous Pavement Roads Structures 0 10 20 30 40 50 60 70 80 90 100 Segment length (meters) Woody Herbaceous Pavement Roads Structures
Frequencies & Distributions In addition to the conventional frequencies described thusfar, there is another type of frequency known as a cumulative frequency. Cumulative frequencies are calculated by starting with the lowest class of an observed variable and its frequency and then adding each successive variable value to the preceding sum. Cumulative frequencies are desirable when we want to know what proportion of observations have a value less than some threshold
Frequencies & Distributions For example, here s some frequency data for the woody vegetation class segments distance from streams in Upper Baisman s Run: CLASS MIN. VALUE FREQ. CUM FREQ. 1 0.00000 9.30 9.30 2 23.31757 7.73 17.03 3 46.63514 7.08 24.11 4 69.95271 5.71 29.82 5 93.27028 4.70 34.52 6 116.58785 3.67 38.19 7 139.90542 3.17 41.36 8 163.22300 2.73 44.09 9 186.54057 5.36 49.45
Conventional Baismans Run Primary Class Distance from Stream Distributions Cumulative Percent of all cells in class Percent of all cells in class 30 25 20 15 10 5 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 Distance to stream along D8 flow paths (meters) Woody Herbaceous Pavement and Road Structures Ground 100 80 60 40 20 0 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 Distance to stream along D8 flow paths (m eters) Woody Herbaceous Pavement and Road Structures Ground
Frequencies & Distributions By examining the shape of freq. distribution curves we can gain some sense of the distribution through some general characteristics: 1. Modality Most distributions are unimodal, but we might also see bimodal or multi-modal dists. (if unimodal, we can also consider): 2. Symmetry a.k.a. skewness of the distribution Is it positively or negatively skewed? 3. Kurtosis Describes the degree of peakedness or flatness of the curve
Shapes of Histograms Bell Shaped Bimodal Mode: value with highest frequency Range: largest value-smallest value Skewed Random Developing a histogram from attribute data is one level of data reduction; we can describe bell shaped distributions using parameters that provide a more concise summary