Accuracy assessment methods and challenges

Accuracy assessmet methods ad challeges Giles M. Foody School of Geography Uiversity of Nottigham giles.foody@ottigham.ac.uk

Backgroud Need for accuracy assessmet established. Cosiderable progress ow see use of probability samplig, provisio of cofidece itervals/se etc. BUT, challeges remai as major errors, biases ad ucertaities remai.

Challeges iclude Class defiitio (what is a forest?) Defiitio of chage modificatio v coversio etc. Impacts of spatial mis-registratio Iter-seor radiometric calibratio Variatios i sesor properties (spatial resolutio etc) Impacts of time of image acquisitio Required precisio of estimatio Rarity ad samplig issues etc. etc. Here focus o issues coected with the groud referece data quality ad size

ART : Error matrix Iterpretatio Commoly evaluate accuracy with basic biary cofusio matrix

opular measures e.g. Sesitivity = roducer s accuracy = T T + FN revalece = T + FN T + FN + F + TN Others (e.g. user s accuracy) may be derived.

A simple questio? How accurate is this classificatio (or estimates of chage)? Is the producer s accuracy = 60/60 6%? Is amout of chage = 60/000 6%?

NO, because the matrix might look like but is actually:

Occurs because groud data set is imperfect: Good ews - ca correct for groud data error. Note - here assumed coditioal idepedece (treds more complex ad ca be i differet directio if ivalid ad will be ivalid i may studies).

Impact o estimatio Real accuracy (%) erceived Groud data Remote sesig RS accuracy revalece 90 80 6 6 95 90 76 3 Systematically uderestimate accuracy of remote sesig chage detectio ad overestimate amout of chage.

Impact of imperfect groud data Systematic bias. e.g. - Uderestimate producer s accuracy. - Typically overestimate prevalece (e.g. amout of chage). Magitude of bias ca be very large for eve if groud data set is highly accurate. Ca correct/compesate for groud data error.

ART : Comparisos Ofte compare (e.g. accuracy over time, chage rates betwee regios). Based o compariso of proportios. Must desig a accuracy assessmet programme to meet its objectives. Oe key cocer is the of the testig set. Too large ay o-zero differece will appear statistically sigificat. Too small programme may fail to detect a importat differece.

Sample size determiatio Ofte based o precisio to estimate proportio p ± h = p ± z α/ (SE) SE = p ( p) z / ( ) = α h

BUT Aim is ofte ot to estimate accuracy to a give precisio but to use i a comparative aalysis - compariso agaist a target - compariso agaist aother accuracy (e.g. classifier compariso).eed to cosider additioal properties.

Compariso Very commo v. target e.g. or classifier evaluatio e.g. z = ˆ κ ˆ κ ˆ σ κ + σ κ ˆ BUT ofte iappropriate & pessimistically biased

Comparative aalysis Comparative aalyses ofte based o hypothesis testig. e.g. H o o differece i accuracy H the accuracy values differ Two types of error: Type I whe H is icorrectly accepted (declare a differece as beig sigificat whe it is ot). Type II whe H o is icorrectly accepted (fail to detect a meaigful differece that does exist).

Type I error H is icorrectly accepted (declare a differece as beig sigificat whe it is ot). robability of makig a Type I error is expressed as the sigificace level, α Commoly set α = 0.05 (i.e. a 5% chace of iferrig a sigificat differece exists whe actually is o differece)

Type II error H o is icorrectly accepted (fail to detect a meaigful differece that does exist). robability of makig a Type II error is β ad related to the power of the test (- β). Type I errors typically viewed x4 more importat tha Type II, so commoly, β = 0. or (- β) = 0.8

If (- β) = 0.8 80% chace of declarig a differece that exists as beig sigificat. Is 0.8 adequate? May studies ofte fail to detect a sigificat differece did the study have sufficiet power? Tests usig small sample sizes ofte uderpowered. Difficult to iterpret o-sigificat results (is there really o differece or just failed to idetify it?)

Estimatig sample size To determie sample size eed to cosider: Sigificace level α ower (- β) Effect size miimum meaigful differece.

e.g. commo remote sesig sceario v target ad with cotiuity correctio: Use acquired data to test for differece usig: 0 0 0 ) ( ) ( + = z z β α 0 4 + + = Q p z o o o / =

e.g. commo sceario v aother accuracy ad with cotiuity correctio: Use acquired data to test for differece usig: ( ) / ) Q ( Q z Q z + + = β α 4 4 + + = p p + + = ) ( ) ( p p z

Note:. Equatios may be re-writte e.g. z β = zα / Q + Q Q. Ca also use alteratives for related samples (e.g. McNemar test). 3. Istead of hypothesis testig could use cofidece itervals.

So what? Remember, importat to use appropriate size Too large ay o-zero differece will appear statistically sigificat. Too small fail to detect a importat differece. Sizes used i remote sesig. - rage from 0s 00s 000s 0,000+

Size eeded: v target

v aother accuracy

Coclusios Error i groud truth ca lead to systematic bias uderestimates accuracy ad is correctable. Accuracy assessmet ofte has a comparative compoet has implicatios for sample size (eed to specify effect size, α, ad β). Required size may be quite large. Need to be aware of dager of usig iappropriate size (too small or too large). The ed