BASICS ON DISTRIBUTIONS

Hstograms Cosder a epermet whch dfferet outcomes are possble (e. Dce tossg). The probablty of all the outcomes ca be represeted a hstogram

Dstrbutos Probabltes are descrbed wth dstrbutos Dscrete varables: data are valued o a dscrete set: I={, } Probablty dstrbuto: f( ) = probablty for the value Cumulatve dstrbuto: F( ) = probablty for a value Normalzato: f I F HIGHEST

Eamples 4 Cosder a co wth a tal probablty equal to p Flp a co three tmes. Plot the probablty dstrbuto for the umber of tals Flp a co utl the frst tal appears. Plot the probablty dstrbuto for the umber of flps

Emprcal Dstrbutos Epermetal data are represeted wth ormalzed hstograms Dscrete varables: data are valued o a dscrete set: I={, } Probablty dstrbuto: f( ) = N(= )/N tot Cumulatve dstrbuto: F( ) = N( )/N tot Normalzato: f I

Characterzato: Mode The mode s the value that occurs the most frequetly a probablty dstrbuto. For emprcal dstrbutos, the mode s the value that most frequetly occurs a data set. More tha oe mode ca be preset

Characterzato: Meda a meda s descrbed as the umerc value separatg the hgher half of a sample, a populato, or a probablty dstrbuto, from the lower half. The meda of a fte lst of umbers ca be foud by arragg all the observatos from lowest value to hghest value ad pckg the mddle oe. If there s a eve umber of observatos, the there s o sgle mddle value; the meda s the usually defed to be the mea of the two mddle values.

Characterzato: Meda

Averages (The Meda) The meda s the mddle value of a set of data oce the data has bee ordered. Eample. Robert ht balls at Grmsby drvg rage. The recorded dstaces of hs drves, measured yards, are gve below. Fd the meda dstace for hs drves. 85, 5, 30, 65, 00, 70, 75, 50, 40, 95, 70 50, 65, 70, 70, 75, 85, 95, 00, 5, 30, 40 Sgle mddle value Ordered data Meda drve = 85 yards

The meda Algorthm. Averages (The Meda) Sort the data If the umber of data s odd: Else: Take the mddle value Take the average betwee the two cetral values

Measurg the Spread wth Meda

Fdg the meda, quartles ad ter-quartle rage. Eample : Fd the meda ad quartles for the data below., 6, 4, 9, 8, 4, 9, 8, 5, 9, 8, 0 Order the data Q Q Q 3 4, 4, 5, 6, 8, 8, 8, 9, 9, 9, 0, Lower Quartle = 5½ Meda = 8 Upper Quartle = 9 Iter-Quartle Rage = 9-5½ = 3½

Epected Value For a gve fucto g() ad a probablty dstrbuto f() the epected value of a dscrete varable s computed as: g( ) g( f E ) I

Propertes of Epected Values ) E[X+Y]=E[X]+E[Y] ) E[X+c]=E[X]+c 3) E[aX+bY]=aE[X]+bE[Y] For eample, f E[X]=5 ad E[Y]=6, the E[X+5]=5+5=0 E[X+5]=*5+5=5 E[3X+Y]=3*5+*6=7 Prove ), ) ad 3)

Characterzato: Mea Mea E f I For emprcal dstrbutos E I N N TOT jdata N TOT j

Mea ad Varace Varace E E E E E Var()

Propertes of Epected Values ad Varaces. Var(X+Y)=Var(X)+Var(Y), f X ad Y are depedet.. Var(aX)=(a^)Var(X) 3. Var(X+c)=Var(X) 4. Var(aX+bY)=(a^)Var(X)+(b^)Var(Y), f X ad Y are depedet. Prove 4

Varace ad Momets of a Radom Varable The th momet of a radom varable X s E[X] Defto The k th momet of a radom varable X s E[X k ] Defto The Varace of a radom varable X s defed as Var[ X ] E[( X E[ X ]) ] E[ X ] (E[ X]). The stadard devato of a radom varable X s [X] Var[X].

Varace ad Momets of a Radom Varable Defto The covarace of two radom varable X ad Y s Cov( X, Y) E[( X [ X])( Y E[ Y])] Theorem For ay two radom varables X ad Y. Var[ X Y] Var[ X] Var[ Y] Cov( X, Y).

Varace ad Momets of a Radom Varable Proof: drectly proof by Var[ X ] E[( X E[ X] Var[ X Y ] E[( X Y E[ X Y ]) ] E[( X Y E[ X ] E[ Y]) ] )] E[( X E[ X ]) ( Y E[ Y]) ( X E[ X ])( Y E[ Y]) E[( X E[ X ]) E[( E[ Y]) E[( X E[ X ])( Y E[ Y])] Var[ X ] Var[ Y ] Cov( XY, )

Varace ad Momets of a Radom Varable Theorem If X ad Y are two depedet radom varables, the E[ X Y] [ X ] E[ Y].

Varace ad Momets of a Radom Varable Proof: drectly proof by E[ X ] Pr( X E[ X Y ] ( j) Pr(( X ) ( Y j)) ( j) Pr( X ) ( Y j) j ( Pr( X ))( j Pr( Y j)) [ X ] E[ Y ]. j j )

Varace ad Momets of a Radom Varable Corollary If X ad Y are depedet radom varables, the ad Cov( XY, ) 0 Var[ X Y] Var[ X ] Var[ Y].

Varace ad Momets of a Radom Varable Defto Let X, X,, X, be mutually depedet radom varables, the Var[ X ] Var[ X ].

Bomal dstrbuto Compute the probablty of obtag eactly 3 tals tossg 5 cos. These are all the possbltes TTTTT HTTTT THTTT TTHTT TTTHT TTTTH HHTTT HTHTT HTTHT HTTTH THHTT THTHT THTTH TTHHT TTHTH TTTHH TTHHH THTHH THHTH THHHT HTTHH HTHTH HTHHT HHTTH HHTHT HHHTT THHHH HTHHH HHTHH HHHTH HHHHT HHHHH f(5)=0/3 Ad what f the probabltes of head ad tal are dfferet?

Bomal dstrbuto How ca we cout wthout eumeratg? Total umber of cofguratos: 5 evets, wth two possbltes each 5 =3 Number of cofguratos wth two heads: ) choose 3 of the 5 postos: 5 (5-) (5-) possbltes ) the order of choce s rrelevat: there are 6 possble permutatos of 3 objects (3!) 5 4 3/ 6 =5!/(3! X!) = 0 Ad what f the probabltes of head ad tal are dfferet?

Bomal dstrbuto Probablty of havg k postve evets out of depedet epermets. The probablty of a postve evet s p

Bomal dstrbuto (mea) p p p j j p p pp k k p p k k k p p k k k kf E k j j j k k k k k k k k k N k 0 0 )! )!( (! )! )!( (! )!!(! Defe j=k-

Bomal dstrbuto (varace) k p p Var (k) E

Varace ad Momets of a Radom Varable Eample Varace of a Bomal Radom Varable The varace of a bomal radom varable X wth parameters ad p ca be determed drectly by computg E[X ]. j j j j E[ X ] C p ( p) j ( ) p p Var[ X ] E[ X ] ( E[ X ]) ( ) p p ( p) p( p)

Evaluato of parameters from emprcal data Geerally speakg, a parametrc model M ams to reproduce a set of kow data Real data (D) Model M Parameters T Modelled data How to fd the best parameters? 3

Mamum lkelhood T* = argma P(D T,M) = T T* = argma log(p(d T,M)) T D=data, M= model, T=model parameters

Evaluato of p Gve depedet epermets, wth k postve evets, how to evaluate the best p? MAXIMUM LIKELIHOOD defto: Best p p that mamses the total probablty of the epermet results Eample Whch s the p that mamses the outcome of 00000000000? Whch s the statstcal error?

Evaluato of p k p k p p k p k p p kp dp df p p k p k f ML ML ML k ML k ML k ML k ML p p k k ML 0 0 0 ), (

Posso dstrbuto the Posso dstrbuto s a dscrete probablty dstrbuto that epresses the probablty of a umber of evets occurrg a fed perod of tme f these evets occur wth a kow average rate (λ) ad depedetly of the tme sce the last evet. (The Posso dstrbuto ca also be used for the umber of evets other specfed tervals such as dstace, area or volume.)

Posso dstrbuto f ( k) k k! e k E k Var

Cotuous Dstrbutos Data are descrbed wth dstrbutos Cotuous varables: data are valued o a cotuous terval: I Cumulatve probablty: F() = N( X)/N tot Prob(A< B) = F(B)-F(A) = Probablty desty fucto (lmt) B A f ( ) d f ( ) df( ) d Normalzato: I f ( ) d

Mea ad Varace Mea E f ( ) d I Varace Var E f ( ) d I

The Uform Probablty Dstrbuto Uform Probablty Desty Fucto f () = /(b - a) for a < < b = 0 elsewhere where a = smallest value the varable ca assume b = largest value the varable ca assume The probablty of the cotuous radom varable assumg a specfc value s 0. P(= ) = 0

Eample: Buffet Customers are charged for the amout of salad they take. Samplg suggests that the amout of salad take s uformly dstrbuted betwee 5 ouces ad 5 ouces. Probablty Desty Fucto where f ( ) = /0 for 5 < < 5 = 0 elsewhere = salad plate fllg weght

Eample: Buffet What s the probablty that a customer wll take betwee ad 5 ouces of salad? F ( ) /0 P( < < 5) = /0(3) =.3 5 0 5 Salad Weght (oz.)

The Uform Probablty Dstrbuto P(8< < ) =? f ( ) /0 P(8< < ) = (/0)(-8) =.4 5 8 5

The Uform Probablty Dstrbuto P(0< < ) =? f ( ) /0 P(0< < ) = P(5< < )= = (/0)(-5) =.7 5 5

The Uform Probablty Dstrbuto Uform Probablty Desty Fucto f () = /(b - a) for a < < b = 0 elsewhere Epected Value of E() = (a + b)/ Varace of Var() = (b - a) / where a = smallest value the varable ca assume b = largest value the varable ca assume Prove t!!

Normal dstrbuto The ormal dstrbuto s cosdered the most basc cotuous probablty dstrbuto. Specfcally, by the cetral lmt theorem, uder certa codtos the sum of a umber of radom varables wth fte meas ad varaces approaches a ormal dstrbuto as the umber of varables creases. For ths reaso, the ormal dstrbuto s commoly ecoutered practce, ad s used throughout statstcs, atural sceces, ad socal sceces as a smple model for comple pheomea. For eample, the observatoal error a epermet s usually assumed to follow a ormal dstrbuto, ad the propagato of ucertaty s computed usg ths assumpto.

Normal dstrbuto f ( ), e μ : mea σ : varace

, ) ( e f μ : mea σ : varace,, d E d E Normal dstrbuto:cotuous

Gaussa dstrbuto a D- dmesoal space where μ s a D-valued vector (meas) ad s a DD symmetrc matr (covarace matr), wth determat 48 T D ep ), (

Estmatg Mea ad Varace from sampled data Cosder a set of depedet ad detcally dstrbuted data, followg a ormal dstrbuto wth mea μ ad varace σ,.., = X How ca we estmate mea ad varace? Eample: X=0.;0.8;0.4;0.4;0.05;0.6;0.5;0.9,0.49

Estmatg mea ad Varace from sampled data MAXIMUM LIKELIHOOD: Estmatg the parameters as to mamze the probablty for the sampled values ML, ML arg ma, f (,.., )

ML Mea ad Varace The probablty for the sampled data s f X, e It has to be mamsed over the varables μ ad σ

ML Mea ad Varace Sce logarthm s a mootoc fucto X f X f,,, l l arg ma, l arg ma, arg ma

ML Mea ad Varace 0 l 0 l,, ML ML ML ML 0 0 ML ML ML ML

ML Mea ad Varace ML ML ML Are these the epressos that you usually apply?

The μ ML s a ubased estmato Cosder a set of data geerated startg from a ormal dstrbuto N(μ, σ )the estmato ML Is a ubased estmate of μ sce ML E d d d e e e E..

The σ ML s a based estmato ML M Is a based estmate of σ sce E E E E E E j j j j j k k j j j j j ML

σ ML s uderestmatg σ Gree: Real dstrbuto Red: Sampled dstrbutos σ ML s evaluated wth respect to the sample mea

Ubased estmato of μ ad σ M S ML ML M

Dstrbuto of the sample mea Cosder dfferet sets of sampled data X k (e.g. dfferet levels of epresso of gees dfferet populatos of dvduals) A sample mea for each set ca be defed M k k k k How are they dstrbuted?

Dstrbuto of the sample meas M M s a lear combato of depedet detcally dstrbuted ormal varables: t s ormally dstrbuted

Dstrbuto of the sample meas var E E M E M E M M E M M E M j j j j var M E M

Dstrbuto of the sample meas

Testg the mea ad the varace of a sample agast a kow dstrbuto Gve a radom varable X, that you suppose ormally dstrbuted wth mea μ ad varace σ kow a pror Suppose to have a sample X, X, X Compute the sample mea M X ad the sample (ubased) varace S X M

Does M sgfcatly dffer from μ? Null hypothess H 0 : M= μ Alteratve hypothess H a : M μ If H 0 s true, the M (the sample mea) s ormally dstrbuted wth mea μ ad varace σ /. The: M Z s ormally dstrbuted, wth mea 0 ad varace

Two-taled Z-test Gve the sampled absolute value of Z we ca compute the probablty of obtag a value equal or hgher, o the ormal dstrbuto. If that probablty (P-value) s lower tha the desred sgfcace P-value (e.g. 0.05) we ca reject H 0 Sgfcace: MAX allowed probablty of rejectg true ull hypotheses P-value/ -Z-score P-value/ Z-score

Is M sgfcatly hgher tha μ? Null hypothess H 0 : M μ Alteratve hypothess H a : M> μ If H 0 s true, M μ. The etreme case for our testg s M=0: that case, aga, Z M s ormally dstrbuted, wth mea 0 ad varace

Z-score Oe-taled Z-test Gve the Z-score computed from the sample we cosder oly the postve values of Z-score ad compute the probablty of obtag a value equal or hgher, o the ormal dstrbuto. If that probablty s lower tha the desred sgfcace (e.g. 0.05) we ca reject H0 P-value

Frst decmal dgt AREA betwee 0 ad Secod decmal dgt

Eample If Z s equal to, ca we reject the ull hypothess at sgfcace 5%? Ad at sgfcace %?

Dstrbuto of the sample devatos If X are depedet, ormally dstrbuted radom varables wth mea μ ad varace σ, the the radom varable Q N X s dstrbuted accordg to the ch-square dstrbuto wth - degrees of freedom. Ths s usually wrtte as: Q The ch-square dstrbuto has oe parameter: (-) that s a postve teger that specfes the umber of degrees of freedom (.e. the umber of depedet X -μ )

Ch-square dstrbuto Mea = k Varace = k

Dstrbuto of the sample devatos So, beg the ubased sample varace s dstrbuted accordg to the ch-square dstrbuto wth k degrees of freedom. It follows that M S S E S E Ubased

Dstrbuto of the sample devatos: Proof for degree of freedom Let radom varable Y be defed as Y = X where X has ormal dstrbuto wth mea 0 ad varace We ca compute the cumulatve fucto

Dstrbuto of the sample devatos: Proof for degree of freedom y>0

Studet s t-dstrbuto ν : degrees of freedom

Normal Vs t- dstrbutos NORMAL K= K=

Etreme value dstrbuto the Gumbel dstrbuto s used to model the dstrbuto of the mamum (or the mmum) of a umber of samples of varous dstrbutos. For eample we would use t to represet the dstrbuto of the mamum level of a rver a partcular year f we had the lst of mamum values for the past te years. It s useful predctg the chace that a etreme earthquake, flood or other atural dsaster wll occur.

Etreme value dstrbuto

BLAST I accordace wth the Gumbel EVD, the probablty p of observg a score S equal to or greater tha s gve by the equato

BLAST: E-value The epect score E of a database match s the umber of tmes that a urelated database sequece would obta a score S hgher tha by chace. The epectato E obtaed a search for a database of D sequeces s gve by