Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Statstcal analyss usng matlab HY 439 Presented by: George Fortetsanaks

Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons

Contnuous dstrbutons Contnuous random varable X takes values n subset of real numbers D R X corresponds to measurement of some property, e.g., length, weght Not possble to talk about the probablty of X takng a specfc value P( X 0 Instead talk about probablty of X lyng n a gven nterval P( 1 X 2 P( X [ 1, 2] P( X P( X [, ]

Probablty densty functon (pdf Contnuous functon p( defned for each D Probablty of X lyng n nterval I D computed by ntegral: Eamples: P ( X l p( d l Important property: P( 1 X 2 P( X [ 1, 2 ] 2 1 p( d P ( X P( X [, ] p( d D P( X D p( d 1

Cumulatve dstrbuton functon (cdf For each D defnes the probablty Important propertes: Complementary cumulatve dstrbuton functon (ccdf ( X P d p X P X P F ( ], [ ( ( ( 0 ( F 1 ( F ( ( ( 1 2 2 1 F F X P ( 1 ( 1 ( ( F X P X P G

Eponental dstrbuton Probablty densty functon Cumulatve dstrbuton functon Memoryless property: P( T t T P( T t

Posson process Random process that descrbes the tmestamps of varous events Telephone call arrvals Packet arrvals on a router Tme between two consecutve arrvals follows eponental dstrbuton Arrval 1 Arrval 2 Arrval 3 Arrval 4 Arrval 5 Arrval 6 Arrval 7 t 1 t 2 t 3 t 4 t 5 t 6 Tme ntervals t 1, t 2, t 3, are drawn from eponental dstrbuton

Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons

Basc statstcs Suppose a set of measurements = [ 1 2 n ] ^ Estmaton of mean value: 1 (matlab m=mean(; 1 Estmaton of standard devaton: (matlab s=std(; n n ^ n ^ n 1 2

Estmate pdf Suppose dataset = [ 1 2 k ] Can we estmate the pdf that values n follow?

Estmate pdf Suppose dataset = [ 1 2 k ] Can we estmate the pdf that values n follow? Produce hstogram

Step 1 Dvde samplng space nto a number of bns -4-2 0 2 4 Measure the number of samples n each bn 3 samples 5 samples 6 samples 2 samples -4-2 0 2 4

P( Frequency Step 2 6 5 3 2-4 -2 0 2 4 E = total area under hstogram plot = 2*3 + 2*5 + 2*6 +2*2 = 32 Normalze y as by dvdng by E 6/32 5/32 3/32 2/32-4 -2 0 2 4

Matlab code functon produce_hstogram(, bns % nput parameters % X =[ 1 ; 2 ; n ]: a column vector contanng the data 1, 2,, n. % bns = [b 1 ; b 2 ; b k ]: A vector that Dvdes the samplng space n bns % centered around the ponts b1, b2,, bk. end fgure; % Create a new fgure [f y] = hst(, bns; % Assgn your data ponts to the correspondng bns bar(y, f/trapz(y,f, 1; % Plot the hstogram label(''; % Name as ylabel('p('; % Name as y

10000 samples 1000 samples Hstogram eamples Bn spacng 0.1 Bn spacng 0.05

Emprcal cdf How can we estmate the cdf that values n follow? Use matlab functon ecdf( Emprcal cdf estmated wth 300 samples from normal dstrbuton

Percentles Values of varable below whch a certan percentage of observatons fall 80th percentle s the value, below whch 80 % of observatons fall. 80 th percentle

Estmate percentles Percentles n matlab: p = prctle(, y; y takes values n nterval [0 100] 80 th percentle: p = prctle(, 80; Medan: the 50 th percentle med = prctle(, 50; or med = medan(; Why s medan dfferent than the mean? Suppose dataset = [1 100 100]: mean = 201/3=67, medan = 100

Roadmap Elements of probablty theory Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons

Problem defnton Dataset D={ 1, 2,, k } collected from an eperment Famles of dstrbutons: Gaussan: θ Eponental: θ Generalzed pareto:, S θ { P1 ( θ1, P2 ( θ2,..., PN ( θν,, } Whch famly of dstrbutons better descrbes the dataset D?

Step 1: Mamum lkelhood estmaton For each famly determne parameter that better fts the data Mamze lkelhood of obtanng the data wth respect to k j j j k j k p p p D p 1 1 2 1 ( ln arg ma ( arg ma,...,, ( arg ma ( arg ma θ θ θ θ * θ θ θ θ θ * θ θ Lkelhood functon Due to ndependence of samples

Eample: eponental dstrbuton Probablty densty functon Defne the log-lkelhood functon Set dervatve equal to 0 to fnd mamum k k k k k e l 1 1 1 1 ln( ln( ln( ( k k k k d dl 1 * 1 0 0 (

Reform queston After MLE: nstead of famles we have specfc dstrbutons P( * * 1 θ * 1, P2 ( θ2,..., PN ( θ Whch dstrbuton better descrbes the data? Choose most approprate dstrbuton based on: Q-Q plots Kullback Lebler dvergence

Method of Q-Q plots Checks how well a probablty dstrbuton P ( * θ descrbes the data Algorthm 1. Draw random datasets Υ 0, Υ 1, Υ 2,, Υ Μ from dstrbutonp ( 2. Compute percentles of these datasets at predefned set of ponts 3. Compute percentles of epermental dataset D at the same ponts 4. Plot percentles of Y 0 aganst percentles of each of Y 1, Y 2,.., Y M 5. Plot percentles of Y 0 aganst percentles of dataset D * θ If plot of step 5 s n the area defned by plots n step 4 the dstrbuton descrbes the data well

Plot percentles of Y0 vs. percentles of Y1

Plot percentles of Y0 vs. percentles of Y2

Plot percentles of Y0 vs. percentles of Y100

Construct envelope

Plot percentles of Y 0 vs. percentles of D Good fttng: The blue curve of orgnal percentles les n the envelope

Plot percentles of Y 0 vs. percentles of D Bad fttng: The blue curve of orgnal percentles les outsde the envelope

Method of Kullback Lebler dvergence Non-symmetrc metrc of dfference between dstrbutons P and Q Dscrete dstrbutons D KL ( P Q N p( log p( 1 q( Contnuous dstrbutons p( D KL ( P Q p( log q( d

Algorthm 1. Dscretze the emprcal pdf of the Dataset D 3 samples 5 samples 6 samples 2 samples 3/16 5/16 6/16 2/16-4 -2 0 2 4-3 -1 1 3 2. Dscretze all dstrbutons P( * * 1 θ * 1, P2 ( θ2,..., PN ( θ 3. Compute KL dvergence of theoretcal dstrbutons wth dataset D 4. Choose the dstrbuton wth the lowest KL dvergence

Onlne materal http://www.csd.uoc.gr/~hy439/schedule09.html Tutorals Statstcs

Cross correlaton corr(, y: estmates the cross correlaton between two tme seres and y R y ( m E[ nm yn] E[ n ynm] The larger the absolute value of the cross correlaton the larger the correlaton of the two varables Whte nose Output of IIR flter No correlaton Some correlaton