Some basic statistics and curve fitting techniques

Some basc statstcs and curve fttng technques Statstcs s the dscplne concerned wth the study of varablty, wth the study of uncertanty, and wth the study of decsonmakng n the face of uncertanty (Lndsay et al., 2004). Statstcs s the scence of collectng, organzng, analyzng and nterpretng data. Nomnal data categores that are not ordered (e.g. taxa). Ordnal data fts n categores that are ordered but level between orders has no objectve measure (e.g. pan level). Scale data fts n categores that are ordered wth unts measures between levels (e.g. unts such as m/s)

Why do we need statstcs? Statstcs helps to provde answers to questons such as: 1. What s the concentraton of plankton at the dock rght now (gven past measurements)? 2. Wll speces x be n the water tomorrow? We are nterested n the lkelhood of the answer and help reduce large datasets nto ther salent characterstcs. The use of statstcs to make a pont: 1. Statstcs never proves a pont (t says somethng about lkelhood). 2. If you need fancy statstc to support a pont, your pont s, at best, weak (Lazar, 1991, personal communcaton)

Why do we need statstcs? Populaton Realzaton Samplng Sample descrpton Parameters of populaton Inference Statstcs of sample

Statstcal descrpton of data Statstcal moments (1 st and 2 nd ): 1 Mean: x = N N å j= 1 N varance: ( ) 2 Var = x j 1 å N -1 j= 1 x j - x Standard devaton: s = Var Average devaton: Adev = 1 N N å j= 1 x j - x Standard error: s error = s N

Standard error: s error = s N When s the uncertanty not reduced by addtonal samplng?

Probablty dstrbuton: Statstcal descrpton of data

Non-normal probablty dstrbuton:

Statstcal descrpton of data Nonparametrc statstcs (when the dstrbuton s unknown): rank statstcs x, x,..., 2 xn 1,2,..., Medan 1 N percentle Devaton estmate The mode Issue: robustness, senstvty to outlers

Statstcal descrpton of data Robust: nsenstve to small departures form the dealzed assumptons for whch the estmator s optmzed. Press et al., 1992, Numercal recpe

Examples from COBOP, Lnkng varablty n IOPs to substrate: Statstcal descrpton of data Boss and Zaneveld, 2003 (L&O)

What do we care about n research to whch statstcs can contrbute? Relatonshps between varables (e.g. do we get blooms when nutrents are plentful?) Contrast between condtons (e.g. s datom vs. dnoflagellate domnaton assocated wth fresh water nput?).

Relatonshp between 2 varables Lnear correlaton: ( )( ) ( ) ( ) å å å - - - - = y y x x y y x x r 2 2 ( )( ) ( ) ( ) å å å - - - - = s S S R R S S R R r 2 2 Rank-order correlaton:

Relatonshp between 2 varables Same mean, Stdev, and r=0.816. Wlks, 2011

y = f(x) Regressons (models) Dependent and ndependent varables: Absorpton spectra. Tme seres of scatterng. What about chlorophyll vs. sze?

Uncertantes n y only: y ( x) 2 c = = ax + b å = 1: N Regressons of type I and type II æ y - a - ç è s bx ö ø 2 Mnmze c 2 by takng the dervatve of c 2 wrt a and b and equal t to zero. What f we have errors n both x and y? y ( x) 2 c = Var = ax + b å = 1: N ( y - ax - b) 2 2 2 ( y - ax - b) = s y + a s x s 2 y + 2 a s 2 2 x Mnmze c 2 by takng the dervatve of c 2 wrt a and b and equal t to zero.

R 2 = 1- MSE/Var(y). The coeffcent of determnaton MSE=mean square error=average error of model^2/varance. What varance does t explan? Can t reveal cause and effect? How s t affected by dynamc range? R s the correlaton coeffcent.

Regressons of type I and type II Classc type II approach (Rcker, 1973): The slope of the type II regresson s the geometerc mean of the slope of y vs. x and the nverse of the slope of x vs. y. y ( x) x( y) a II ± = = = = cy a sgn ax + b + c d = ± s { å x } y y s x

Flterng nosy sgnals. Smoothng of data What s nose? nstrumental (electronc) nose. Envronmental nose. one person s nose may be another person s sgnal Matlab: fltflt

Lab aggregaton exp.: Method of fluctuaton Sample volume Measurement tme Brggs et al., 2013

Modelng of data Condense/summarze data by fttng t to a model that depends on adjustable parameters. Example, CDM spectra: a g ~ l ( l) = a exp( - s( l - )) g 0 partculate attenuaton spectra: c ( l) = c~ p p æ ç è l l 0 ö ø -g

Example: CDM spectra. Mert functon: c a Þ a Modelng of data ( l )- a (- s( l - l )) 2 9 exp 2 éag g = ( l) = a exp( - s( l - )) å = 1 = ê ë ~ l [ ] a~, s g For non-lnear models, there s no guarantee to have a sngle mnmum. Need to provde an ntal guess. Matlab: fmnsearch g g ~ s 0 0 ù ú û

Modelng of data Lets assume that we have a model y = y( l;a) A more robust mert functon: N å ( l )- y( l ; ) ~ y a c = s = 1 Problem: dervatve s not contnuous. Can be used to ft lnes.

Statstcal descrpton of data Press et al., 1992

Monte-Carlo/Bootstrap methods Need to establsh confdence ntervals n: 1. Fttng-model parameters (e.g. CDM ft). 2. Model output (e.g. Hydrolght). n out

Bootstrap When there s an uncertanty (or possble error) assocated wth the nput: Vary nputs wth random errors and observe effect on output: n 1 out 1 n 2 out 2 n 3 out 3 n N out N

Bootstrap Example: how to assgn uncertantes n derved spectral slope of CDOM. Mert functon: 9 =1 χ 2 = a g λ ( ) ± Δ!a g exp s( λ λ 0 ) ( ( )) Randomly add uncertantes (D ) to each measurement, each tme performng the ft (e.g. usng randn.m n Matlab, RAND n Excel). Then do the stats for the dfferent s. 2

Summary Use statstcs logcally. If you don t know the underlyng dstrbuton use non-parametrc stats. Statstcs does not prove anythng but can gve you a sense of the lkelhood of a hypothess (about relatonshps). I strongly encourage you to study hypothess tests and Baysan methods. Beware that they are often msused