AEC 874 (2007) Field Data Cllectin & Analysis in Develping Cuntries VII. Data Analysis & Prject Dcumentatin Richard H. Bernsten Agricultural Ecnmics Michigan State University 1 A. Things t Cnsider in Planning Data Analysis 1. Yur Research Prpsal What were yur riginal research bjectives? Are these bjectives still apprpriate, r d yu need t mdify them? 2. Yur Target Audience a) Wh is the audience fr yur analysis? Academic faculty? Plicy makers? Anther client? Multiple audiences? b) What are the expectatins f yur target audience, regarding the type f analysis? 2 1
3. Yur Data a) What type f analysis is pssible, with the data yu have cllected? Sample size Few vs. many cases? Measurement level (see Andrews et. Al.) Nminal categrical? Ordinal Likert scales? Scale-cntinuus numeric data? Data level Husehld vs. variety level? 4. Yur Statistical Expertise a) Nvice? b) Expert? 3 B. Statistics, Data & Analysis 1. Rle f Statistics Summarize data (descriptive statistics) Reveal relatinships (measures f assciatin) 2. Classes f Statistics Univariate - ne variable (e.g,, mean, median, mde) Bivariate - tw variables (e.g, Chi square, crrelatin analysis) Multivariate - several variables (requires gd data) (e.g., regressin, lgit/prbit analysis) 4 2
3. Types f Data (measurement level, SPSS) Nminal data--data values represent categries with n intrinsic rder (e.g., gender, types f incme surces) Ordinal data data values represent categries with sme intrinsic rder (e.g., Likert, rank-rder scales) Scale data data values are cntinuus numeric values n an interval r rati data scale (e.g., age, incme, yield) Nte Yu can transfrm scale data t categrical data, but categrical data can t be transfrmed t cntinuus data implicatins fr data cllectin? 5 4.Types f Analysis (by data type) a) Descriptive Analysis (Fig. 10.3) 1) Nminal/categrical data Frequencies tables--describe data distributins with numbers r percents SPSS utput reprts data categries, numbers f bservatins, & percent (ttals, adjusted, cumulative) May als ask SPSS t reprt data as histgrams, hrizn bar charts, pie charts Limit the number f data categries t <10 If yu have >9, cmbine categries with few cases int ther 6 3
2) Scale/cntinuus numeric data Measures f central tendency Mean is the average case (arithmetic average) Nt valid fr nminal/categrical data Nt usually used fr rdinal data (i.e., can t assume equal distance between items) Very sensitive t distributin f scale data Median is the middle case Use if scale data are asymmetric Use fr rdinal data Mde is the mst cmmn data value Only is an indicatr f central tendency fr nminal/categrical data Nte--Fr nrmally distributed data, mean=median=mde 7 Measures f Dispersin/Spread Minimum is lwest value Maximum is highest value Range is high/lw interval Standard deviatin (SD) indicates percent f cases in a certain ranges (if the data are nrmally distributed) Shape f the Distributin (fr scale data) (Fig. 10.5, 10.9) Skewness shws degree & directin f asymmetry If symmetrical, cefficient = 0 If skewed left, cefficient = psitive (left) If skewed right, cefficient = negative (right) 8 4
Kurtsis measures peakedness f distributin (Figs. 10.6, 10.10) If same as nrmal distributin, cefficient = 0 If very peaked, cefficient = psitive If very flat, cefficient = negative Nte--If skewness r kurtsis value is nt clse t 0 Mean isn t an apprpriate measure f central tendency Standard deviatin isn t an accurate measure f dispersin Prblem n clear definitin f meaning f nt clse t 0 9 b) Analysis f the Relatinship/Assciatin Between Variables Questin--D pairs f variables mve tgether r are they independent? Bivariate analysis des nt require yu t assume/identify a dependant/independent variable Multivariate analysis assesses the relatinship between a dependant & independent variables Dependant variable --variable being affected Independent variable --variable(s) affecting the dependent variable Crrelatin des nt imply nt causatin Statistics that measure assciatin d nt indicate causatin Only thery implies causatin 10 5
Chice f apprpriate statistic t assess relatinships depends n Type f variables nminal, rdinal r scale (cntinuus) Which variable is independent/dependent Cnsideratins in Chsing a Statistical Methd Dependant Variable Nminal r Ordinal Data Interval r Rati Data Nminal r Ordinal Data (Discrete Categrical) Crss-Tabulatin Paired t test ANOVA Independent Variable Interval r Rati Data (Cntinuus, numeric Discriminant, Prbit, Lgit Crrelatin Regressin See Andrews, A Guide fr Selecting Statistical Techniques fr Scial Science Analysis fr details. Overheads 11 C. Strategies fr Analyzing Survey Data 1. Review yur research bjectives, hyptheses, and questinnaire 2. Develp a tentative reprt utline (analytical plan) 3. Use descriptive statistics t explre yur data (e.g., frequencies, mean, median, mde, SD, skewness, kirtsis) Use these results t decide What sub-grup cmparisns are pssible/lk interesting explre (e.g., Is there enugh variability t justify further analysis?) What assciatin can yu assess with the data? 4. Revise yu analytical plan, based n yur new knwledge, regarding the characteristics f the data 5. Finally, use bivariate/multivariate statistics t assess relatinships/assciatins 12 6
D. Strategies & Cnsideratins in Using Statistics Begin yur analysis using descriptive analysis, then lk fr assciatins t explain relatinships 1. Describe the Variables Basic analysis a) Nminal/categrical variables 1) Strategies t cnsider First run frequencies/percents (Example 10.1) if there are very few cases in a categry, cmbine/recde sme categries t ther Be sure t save the riginal variables (with riginal cdes) in an archive file r rename t a new variable befre recding 13 If there are many cases in the categry ther, recde sme f these cases t specific categries (if pssible) Cnsider recding cntinuus data int a few grups (e.g., recde cntinuus variable educatin t: 1=0-11, 2=12, 3=13-15, 4=> 16; r likert scale data (1-5) t 1-2, 3, 4-5) Review the frequency distributin t decide what break pints t use fr regruping cntinuus data t categrical data (e.g., first ½=lw, secnd ½=high; first 1/3=lw, secnd 1/3, medium, third 1/3=high) After recding data, g t the variable view and update variable values/infrmatin fr the new/recded variables 2) Statistics fr nminal/categrical data Mde it the apprpriate statistic fr assessing central tendency 14 7
b) Scale (interval/rati, cntinuus data) variables 1) Strategies t cnsider Run means, mde, median, range, skewness, kurtsis, and standard deviatin Then, lk fr utlyers; assess the nrmal distributin assumptin 2) Statistics If data ARE apprximately nrmally distributed Present mean (mde, median) If data are NOT apprximately nrmally distributed: Recde t categrical data and present the distributin spread in a frequency table 15 2. Lking fr Relatinships--Statistical Inference Def. Making inferences abut the ppulatin parameters frm estimates f sample statistics (requires randm sampling) a) Sme Cncepts 1) Standard Errr f the Estimate Backgrund We sample frm a ppulatin t generate sample statistics t estimate unknwn ppulatin parameters. Different samples will give different estimates. The theretical distributin f all pssible values f a statistic btained frm a ppulatin is the sampling distributin f the statistic. The mean f the sampling distributin is the expected value f the statistic. The standard deviatin is the standard errr. When we estimate the SE frm a single sample SD SE x = --------- \/ N 16 8
SE f mean (a SPSS descriptive statistics ptin) indicates hw clse/far the sample mean is t ppulatin mean Fr means f interval/rati data & percentages, reprt the SE and/r the margin f errr, which is a multiple f the SE At 99% CI, ME=2.57 SE At 95% CI, ME=2.00 SE At 90% CI, ME=1.65 SE Sample Size and Data Distributin A Cautin If the sample is large, the sampling distributin f the sample mean is apprximately nrmal, even if ppulatin was nt nrmally distributed. If the ppulatin is small and nt nrmal, the sampling distributin f mean wn t be nrmal, limiting statistical inference In such cases, yu shuld use nn-parametric statistics t analyze the data This is why survey researchers ften use the chi square 17 statistic t analyze survey data 2) Cnfidence Interval (CI) Def. A range arund sample mean, based n the SE (i.e., 95% CI is range +/- 2 SEs) SE and CI indicate reliability f a statistic b) Statistical Significance These statistics all shw the degree f assciatin & statistical significance (nn-significance) Significance indicates the prbability that a relatinship exists in sample, if it desn t exist in ppulatin (e.g., 1% prbability that yu accept a false H as true) Alpha/critical level f prbability fr acceptance is researchers/spnsr determined 18 9
Traditinal alpha levels f 99%/95% are cnventins, nt abslutes (Fisher, agricultural research). Must cnsider the cnsequence f accepting a false result as true Example A traditinal variety yields 500 kg/ah & a mdern variety yields 800 kg/ha, but the difference is nly significant at the 80% level. Each variety cst the same price. Wuld yu plant the MV r the TV? It s ften mre infrmative t reprt the level at which yur results are significant, rather than simply saying they are nn-significant (e.g., The means are significantly different at the 88% level) Lack f statistical significance may be due t the fact that N relatinship exists Nn-sampling errr was large, s data are nt accurate The sample size is small, s the SE is large 19 Statistical significance des NOT indicate the imprtance f yur result!!! The imprtance f a result is a functin f the size f the cefficient & the meaning that the variables/relatinships imply. Statistical results are either significant r nn-significance (NOT insignificant) A result may be statistically significant, but still insignificant (i.e., very small, and thus nt imprtant) Even if the differences in the numerical values are large (e.g. mean yields f 500 kg/ha vs. 1,000 kg/ha), if the relatinship is nn-significant, this implies that the values are essentially the same. S, dn t emphasize the magnitude f the nn-significant difference when reprting yur results. 20 10
c) Measures f Assciatin Used t Analyze Survey Data 1) Crsstabulatin (Chi square analysis, X 2 ) Objective T test if the distributin f ne variable differs significantly fr values f ther variable Data Requirements: Bth variables must be categrical (I.e., nminal, rdinal) But yu can cnvert scale data variables t categrical variables and then use Chi square analysis Dn t need t assume the data are nrmally distributed Dn t need t identify a dependent/independent variable Mst cmmn measure f assciatin fr survey variables (Why?) 21 A Wrd f Cautin The X 2 statistic is invalid if the expected value is <5. Hwever SPSS will still reprt a X 2 value even if it is meaningless!!! In a crss-tab table, the cell with the smallest expected frequency (nt the actual frequency) is the ne n the rw with the smallest rw ttal & in the clumn with the smallest clumn ttal (Table 10.3) T estimate the expected cell frequency, divide the smallest rw ttal in the crss-tab table by N & multiply this number by the smallest clumn ttal. Evaluate: < 5?) Suggestins (Table example) (Example 11.4) The variable yu chse as the rw/clumn variable nt critical It s cnfusing t interpret the results if yu request bth clumn & rw percents, s request nly clumn percents 22 11
If N is small (< 200?), cnstruct crss-tab tables with 3 r fewer categries/variable Why? If the N is very small (< 100?), use the results in the crss-tab table t estimate the expected frequency Why? If the expected value < 5, recde the data int fewer equal size grups t increase the expected value Statistics SPSS reprts the X 2 statistic (larger is better) & the prbability level (smaller is better) (Example 11.3) In the text f an article, reprt the directin f the bserved relatinship & prbability level (in parentheses) [e.g., X 2 analysis indicates a significant (95% level) negative relatinship between age & educatin] In the table, reprt crss-tab results, X 2 statistic & the prbability level 23 2) Analysis f Variance (ne-way) Objective Determine if the mean values f the dependant variable are fr each categry f the independent variable, significantly different (t-test is a special case) Data Requirements Must identify an independent & dependant variables Independent variable--categrical data with 2 r mre categries (e.g., 2 r varieties) (Fig. 11.5) Dependent variable--scale (cntinuus) data (e.g., yield f several varieties) Each case f the dependant variable must be independent f the ther Cautin Spread f data pints (I.e., variance) in independent variable must be similar fr each data categry & nrmally distributed 24 12
Suggestins Test fr hmgeneity f variances Dn t use ANOVA, if variances are very different r sample sizes f grups differ greatly Statistics (Example 11.5) F-test evaluates significance (i.e., HO that all means are equal) Multiple cmparisns test (Shaffe) indicates if individual means are different (pairwise cmparisns) In the text f an article, reprt directin f the relatinship, significantly different means & F-test statistic [e.g., ANOVA indicates the mean yield f variety A (845 kg/ha) & B (933 kg/ha) are significantly (95% level) higher than the yield f variety C (534 kg/ha), with a F-value f 6.75] In tables, reprt grup means, F-test (prbability level fr the ANOVA) & the multiple cmparisn test (Scheffe) results 25 3) Crrelatin Analysis Objective Measures the degree that 2 cntinuus variables mve tgether frm ne case t anther Data Requirements Bth variables must be scale (cntinuus) r rdinal data Dn t need t identify a dependant/independent variable Suggestins Run crrelatins t explre ptential relatinships Statistics Different types f data require different statistics Fr interval/rati scale data, use Pearsn s prductmment crrelatin Fr rdinal data, use Spearman rank crrelatin 26 13
Crrelatin cefficient (r) indicates strength f relatinship & ranges frm 0 t +/-1 (Example 11.6) Sign indicates directin f relatinship (Fig. 11.7) Sign psitive (+), direct Sign negative (-), inverse Cefficient f determinatin (r 2 ) indicates the percent f shared variance In text f an article, reprt the directin f the relatinship (psitive/negative), crrelatin cefficients (r) & r 2 [e.g., Crrelatin analysis indicated that yield & N-fertilizer rates are psitively crrelated (r =0.79), with a R 2 f 0.62] In the table, reprt the crrelatin cefficient (r), signs, and the prbability level (r 2 ) May present several variables/crrelatins in matrix frmat, which is ften included as an appendix 27 4) Regressin Analysis Objective Measures the relatinship between cntinuus independent & dependent variables (Fig. 11.9) Data Requirements Must identify 1 dependant variable, 1 r mre independent variables Independent & dependant variables are usually scale data But can use dummy independent variables (0,1) in multiple regressin Linear mdels are mst cmmn, but can use ther functinal frms, depending n yur assessment f the theretical relatinship (e.g., lg, quadratic mdels) 28 14
Suggestins The scatter f plts indicates the data distributin, which must be well-distributed ver the range f data values (Fig. 11.10) Print ut scatter plts f dependent/independent variables (e.g., yield, fertilizer) & assess the scatter plts t find utlyers Check fr utlyers befre running a regressin & cnsider drpping cases with extreme/impssible values (i.e., small plts > measurement errr) Use thery (and pssibly scatter plts) t specify mdel & functinal frm, but avid stepwise prcedure (data mining) Thery suggests that yield increases with higher N applicatin & then declines suggesting a quadratic mdel But farm-level data seldm includes extremely high N rates justifying a linear mdel 29 Review the crrelatin matrix t identify highly crrelated (>90%) variables (mulicllinearity) in the mdel. If any variables are highly crrelated, drp ne r mre f these variables (Example 11.6) Missing data fr any variable will eliminate that case frm the mdel, which is especially a prblem in multiple regressin The criteria fr deciding if the mdel is a gd fit (R 2 ) fr the data is a functin f the type f relatinship scial analysis ften reprts data with a lw R 2 Avid including dminant independent variables in yur mdel (e.g., Prductin = harvested area, fertilizer, labr, etc.). Can use standardized cefficient mdel t estimate the percent cntributin f each independent variables 30 15
Statistics (Example 11.7) The cnstant shws the value f the dependant variable when the independent variable(s) equal(s) zer The regressin cefficient indicates the change in the dependant variable that is assciated with a 1 unit change in the independent variable Significance f a cefficient is estimated by dividing the cefficient by its SE, and then cmparing this value t the t-distributin value R 2 indicates strength f f the influence f the independent variables n the dependant variables--ranges frm 0-1 (i.e., nne/cmplete); Evaluate R 2 bar, which adjusts fr degrees f freedm Why? F-value indicates the prbability that all betas are equal 31 In the text f an article, reprt the directin f the relatinship, beta cefficient, its significance, R 2 & the F-value [e.g., Regressin analysis indicated that the nitrgen applicatin rate (0.44) & weeding days (0.22) are significantly assciated (95% level) with yield. The mdel had a R 2 value f 0.65 & a significant (99%) F-value. Als, list & discuss nn-significant cefficients Why are they nn-significant? In tables, reprt all variables, cefficients, SE (in parentheses belw cefficient), significance levels (***=.01,** =.05,** =.10*), F-value & R 2 bar Nte: Many relatinships that are significant in bivariate relatinships, will be nn-significant in a multivariate mdel Why? 32 16
5. Lgit & Prbit Analysis Objective Measures the degree & directin f the relatinship between a cntinuus independent variables & a categry f a dependant variable Data Requirements Dependant variable is categrical (e.g., adpter/nn-adpter) Independent variable is scale (cntinuus) data Statistics Number f cases crrectly classified, cntributin f each independent variable t predictin (cefficients), significance f each independent variable 33 E. Respnsibility fr Analysis Primary respnsibility fr analysis lies with the researcher(s) wh Designed the prject Identified the research issues Develped the questinnaires Supervised data cllectin & therefre Knw the analytical needs & limitatins f the data 34 17
F. Dcumenting the Prject Purpse Prvide a permanent recrd f the prject Prvide a reference fr yur analysis Prvide a reference fr ther users 1. Archive Prject Materials & Leave at the Research Lcatin Assemble questinnaires (fr future reference), pst-cding sheets, etc. Make a cpy f the data n CDs Make a cpy f the Prject Dcumentatin Categrize, label & stre all material in a safe place that is prtected frm heat (sun), magnetic interference, mld, etc. 35 2. Prject Dcumentatin (bund vlume) Prject Dcumentatin (summary) (e.g., prject title, spnsrs, gegraphical cverage, dates, prject verview, publicatins) Descriptin f Survey Methdlgy (e.g., verview f research issues, survey lcatins, sampling methd/limitatins, enumeratr selectin/training, mdule design prcess, survey instruments, data entry) Survey Dcumentatin (fr each mdule) (e.g., purpse, tpics cvered, sample size, data level, unit f bservatin, number f runds, survey areas & dates, time reference fr data (seasn, mnths), base fine name, cpies f mdules (all languages), names f enumeratrs & respndents by survey lcatin) 36 18
SPSS Systems/Data File Summaries (all SPSS files) (e.g., name f all base files (mdule name), descriptin f data, data limitatins, file infrmatin printuts, histry f base file mdificatins/transfrmatins including names f new files created) 3. Suggestins fr Dcumenting Mdified Systems Files Failure t updated files/variable descriptins is a majr prblem a) Suggestins fr Recded/Cmputed Variables Dn t recde the riginal variable. First create a new variable frm the data and recde these data Name recded/cmputed variable with a name that begins with R/C t indicate it was recded/cmputed Immediately create value labels /etc. fr all new variable Describe variable transfrmatins in the variable label [i.e., Yield (yield=prd/area)] 37 b) Keep a Permanent Recrd (file) f Data Transfrmatins Paste SPSS cmmands int the Syntax Editr, then run them frm the editr. Save this file! At the end f the first SPSS sessin, cpy the syntax that yu want t save/archive int a wrd prcessing file and at the end f each subsequent SPSS sessin, add the new syntax cmmands t a wrd prcessing file c) Peridically Print ut the File Infrmatin After making transfrmatins, print ut the new file infrmatin d) Cleaning Up Yur Current Wrk File After transfrming a variable, drp ld variable frm the current versin f the file Be sure t save the riginal variable in an earlier versin f the file 38 19
Return t p. 6 39 Return t p. 8 40 20
Return t p. 9 41 Return t p. 13 42 21
Return t p. 22 43 Return t p. 22 44 22
Return t p. 23 45 Return t p. 24 46 23
Return t p. 25 47 Return t p. 27 48 24
Return t p. 27 49 Return t p. 28 50 25
Return t p. 29 51 Return t p. 31 52 26