Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Similar documents
Statistics Spring MIT Department of Nuclear Engineering

U-Pb Geochronology Practical: Background

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Negative Binomial Regression

Chapter 1. Probability

Composite Hypotheses testing

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Some basic statistics and curve fitting techniques

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Chapter 13: Multiple Regression

Gaussian Mixture Models

Goodness of fit and Wilks theorem

IRO0140 Advanced space time-frequency signal processing

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

A random variable is a function which associates a real number to each element of the sample space

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Hydrological statistics. Hydrological statistics and extremes

Probability and Random Variable Primer

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

Maximum Likelihood Estimation

Notes prepared by Prof Mrs) M.J. Gholba Class M.Sc Part(I) Information Technology

PHYS 450 Spring semester Lecture 02: Dealing with Experimental Uncertainties. Ron Reifenberger Birck Nanotechnology Center Purdue University

Expectation Maximization Mixture Models HMMs

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Suites of Tests. DIEHARD TESTS (Marsaglia, 1985) See

Engineering Risk Benefit Analysis

Probability Theory (revisited)

Introduction to Random Variables

Comparison of Regression Lines

Randomness and Computation

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Generative classification models

Lecture Notes on Linear Regression

NUMERICAL DIFFERENTIATION

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Statistics Chapter 4

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede. . For P such independent random variables (aka degrees of freedom): 1 =

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Laboratory 1c: Method of Least Squares

PhysicsAndMathsTutor.com

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Convergence of random processes

Modeling and Simulation NETW 707

CS-433: Simulation and Modeling Modeling and Probability Review

Independent Component Analysis

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

ROBUST AND EFFICIENT ESTIMATION OF THE MODE OF CONTINUOUS DATA: THE MODE AS A VIABLE MEASURE OF CENTRAL TENDENCY

Appendix B: Resampling Algorithms

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Homework Assignment 3 Due in class, Thursday October 15

RELIABILITY ASSESSMENT

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Lecture 3: Probability Distributions

Chapter 7 Channel Capacity and Coding

STAT 511 FINAL EXAM NAME Spring 2001

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Distributions /06. G.Serazzi 05/06 Dimensionamento degli Impianti Informatici distrib - 1

Chapter 11: Simple Linear Regression and Correlation

Conjugacy and the Exponential Family

Laboratory 3: Method of Least Squares

SDMML HT MSc Problem Sheet 4

Quantifying Uncertainty

The Geometry of Logit and Probit

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

An Application of Fuzzy Hypotheses Testing in Radar Detection

A REVIEW OF ERROR ANALYSIS

Retrieval Models: Language models

Limited Dependent Variables

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Generalized Linear Methods

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Lecture 3: Shannon s Theorem

Chapter 3 Describing Data Using Numerical Measures

Dimension Reduction and Visualization of the Histogram Data

Lecture 12: Discrete Laplacian

Statistical inference for generalized Pareto distribution based on progressive Type-II censored data with random removals

Chapter 9: Statistical Inference and the Relationship between Two Variables

A Robust Method for Calculating the Correlation Coefficient

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Maximum Likelihood Estimation (MLE)

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

EM and Structure Learning

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Economics 130. Lecture 4 Simple Linear Regression Continued

Exam. Econometrics - Exam 1

4.1. Lecture 4: Fitting distributions: goodness of fit. Goodness of fit: the underlying principle

Introduction to Regression

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Feb 14: Spatial analysis of data fields

3.1 ML and Empirical Distribution

See Book Chapter 11 2 nd Edition (Chapter 10 1 st Edition)

9. Binary Dependent Variables

A be a probability space. A random vector

ANSWERS CHAPTER 9. TIO 9.2: If the values are the same, the difference is 0, therefore the null hypothesis cannot be rejected.

Classification as a Regression Problem

Transcription:

Statstcal analyss usng matlab HY 439 Presented by: George Fortetsanaks

Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons

Contnuous dstrbutons Contnuous random varable X takes values n subset of real numbers D R X corresponds to measurement of some property, e.g., length, weght Not possble to talk about the probablty of X takng a specfc value P( X 0 Instead talk about probablty of X lyng n a gven nterval P( 1 X 2 P( X [ 1, 2] P( X P( X [, ]

Probablty densty functon (pdf Contnuous functon p( defned for each D Probablty of X lyng n nterval I D computed by ntegral: Eamples: P ( X l p( d l Important property: P( 1 X 2 P( X [ 1, 2 ] 2 1 p( d P ( X P( X [, ] p( d D P( X D p( d 1

Cumulatve dstrbuton functon (cdf For each D defnes the probablty Important propertes: Complementary cumulatve dstrbuton functon (ccdf ( X P d p X P X P F ( ], [ ( ( ( 0 ( F 1 ( F ( ( ( 1 2 2 1 F F X P ( 1 ( 1 ( ( F X P X P G

Eponental dstrbuton Probablty densty functon Cumulatve dstrbuton functon Memoryless property: P( T t T P( T t

Posson process Random process that descrbes the tmestamps of varous events Telephone call arrvals Packet arrvals on a router Tme between two consecutve arrvals follows eponental dstrbuton Arrval 1 Arrval 2 Arrval 3 Arrval 4 Arrval 5 Arrval 6 Arrval 7 t 1 t 2 t 3 t 4 t 5 t 6 Tme ntervals t 1, t 2, t 3, are drawn from eponental dstrbuton

Roadmap Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons

Basc statstcs Suppose a set of measurements = [ 1 2 n ] ^ Estmaton of mean value: 1 (matlab m=mean(; 1 Estmaton of standard devaton: (matlab s=std(; n n ^ n ^ n 1 2

Estmate pdf Suppose dataset = [ 1 2 k ] Can we estmate the pdf that values n follow?

Estmate pdf Suppose dataset = [ 1 2 k ] Can we estmate the pdf that values n follow? Produce hstogram

Step 1 Dvde samplng space nto a number of bns -4-2 0 2 4 Measure the number of samples n each bn 3 samples 5 samples 6 samples 2 samples -4-2 0 2 4

P( Frequency Step 2 6 5 3 2-4 -2 0 2 4 E = total area under hstogram plot = 2*3 + 2*5 + 2*6 +2*2 = 32 Normalze y as by dvdng by E 6/32 5/32 3/32 2/32-4 -2 0 2 4

Matlab code functon produce_hstogram(, bns % nput parameters % X =[ 1 ; 2 ; n ]: a column vector contanng the data 1, 2,, n. % bns = [b 1 ; b 2 ; b k ]: A vector that Dvdes the samplng space n bns % centered around the ponts b1, b2,, bk. end fgure; % Create a new fgure [f y] = hst(, bns; % Assgn your data ponts to the correspondng bns bar(y, f/trapz(y,f, 1; % Plot the hstogram label(''; % Name as ylabel('p('; % Name as y

10000 samples 1000 samples Hstogram eamples Bn spacng 0.1 Bn spacng 0.05

Emprcal cdf How can we estmate the cdf that values n follow? Use matlab functon ecdf( Emprcal cdf estmated wth 300 samples from normal dstrbuton

Percentles Values of varable below whch a certan percentage of observatons fall 80th percentle s the value, below whch 80 % of observatons fall. 80 th percentle

Estmate percentles Percentles n matlab: p = prctle(, y; y takes values n nterval [0 100] 80 th percentle: p = prctle(, 80; Medan: the 50 th percentle med = prctle(, 50; or med = medan(; Why s medan dfferent than the mean? Suppose dataset = [1 100 100]: mean = 201/3=67, medan = 100

Roadmap Elements of probablty theory Probablty dstrbutons Statstcal estmaton Fttng data to probablty dstrbutons

Problem defnton Dataset D={ 1, 2,, k } collected from an eperment Famles of dstrbutons: Gaussan: θ Eponental: θ Generalzed pareto:, S θ { P1 ( θ1, P2 ( θ2,..., PN ( θν,, } Whch famly of dstrbutons better descrbes the dataset D?

Step 1: Mamum lkelhood estmaton For each famly determne parameter that better fts the data Mamze lkelhood of obtanng the data wth respect to k j j j k j k p p p D p 1 1 2 1 ( ln arg ma ( arg ma,...,, ( arg ma ( arg ma θ θ θ θ * θ θ θ θ θ * θ θ Lkelhood functon Due to ndependence of samples

Eample: eponental dstrbuton Probablty densty functon Defne the log-lkelhood functon Set dervatve equal to 0 to fnd mamum k k k k k e l 1 1 1 1 ln( ln( ln( ( k k k k d dl 1 * 1 0 0 (

Reform queston After MLE: nstead of famles we have specfc dstrbutons P( * * 1 θ * 1, P2 ( θ2,..., PN ( θ Whch dstrbuton better descrbes the data? Choose most approprate dstrbuton based on: Q-Q plots Kullback Lebler dvergence

Method of Q-Q plots Checks how well a probablty dstrbuton P ( * θ descrbes the data Algorthm 1. Draw random datasets Υ 0, Υ 1, Υ 2,, Υ Μ from dstrbutonp ( 2. Compute percentles of these datasets at predefned set of ponts 3. Compute percentles of epermental dataset D at the same ponts 4. Plot percentles of Y 0 aganst percentles of each of Y 1, Y 2,.., Y M 5. Plot percentles of Y 0 aganst percentles of dataset D * θ If plot of step 5 s n the area defned by plots n step 4 the dstrbuton descrbes the data well

Plot percentles of Y0 vs. percentles of Y1

Plot percentles of Y0 vs. percentles of Y2

Plot percentles of Y0 vs. percentles of Y100

Construct envelope

Plot percentles of Y 0 vs. percentles of D Good fttng: The blue curve of orgnal percentles les n the envelope

Plot percentles of Y 0 vs. percentles of D Bad fttng: The blue curve of orgnal percentles les outsde the envelope

Method of Kullback Lebler dvergence Non-symmetrc metrc of dfference between dstrbutons P and Q Dscrete dstrbutons D KL ( P Q N p( log p( 1 q( Contnuous dstrbutons p( D KL ( P Q p( log q( d

Algorthm 1. Dscretze the emprcal pdf of the Dataset D 3 samples 5 samples 6 samples 2 samples 3/16 5/16 6/16 2/16-4 -2 0 2 4-3 -1 1 3 2. Dscretze all dstrbutons P( * * 1 θ * 1, P2 ( θ2,..., PN ( θ 3. Compute KL dvergence of theoretcal dstrbutons wth dataset D 4. Choose the dstrbuton wth the lowest KL dvergence

Onlne materal http://www.csd.uoc.gr/~hy439/schedule09.html Tutorals Statstcs

Cross correlaton corr(, y: estmates the cross correlaton between two tme seres and y R y ( m E[ nm yn] E[ n ynm] The larger the absolute value of the cross correlaton the larger the correlaton of the two varables Whte nose Output of IIR flter No correlaton Some correlaton