EE/CpE 345. Modeling and Simulation. Fall Class 10 November 18, 2002

Similar documents
EE/CpE 345. Modeling and Simulation. Fall Class 9

Modeling and Performance Analysis with Discrete-Event Simulation

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

SIMULATION SEMINAR SERIES INPUT PROBABILITY DISTRIBUTIONS

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

STT 843 Key to Homework 1 Spring 2018

Chapter 5 continued. Chapter 5 sections

Dr. Maddah ENMG 617 EM Statistics 10/15/12. Nonparametric Statistics (2) (Goodness of fit tests)

Distribution Fitting (Censored Data)

MIT Spring 2015

f (1 0.5)/n Z =

IE 581 Introduction to Stochastic Simulation

2008 Winton. Statistical Testing of RNGs

Chapter 5. Chapter 5 sections

Chapter Learning Objectives. Probability Distributions and Probability Density Functions. Continuous Random Variables

Hypothesis testing:power, test statistic CMS:

Statistics for Data Analysis. Niklaus Berger. PSI Practical Course Physics Institute, University of Heidelberg

IE 303 Discrete-Event Simulation L E C T U R E 6 : R A N D O M N U M B E R G E N E R A T I O N

Practice Problems Section Problems

Topic 12 Overview of Estimation

Correlation and Regression Analysis. Linear Regression and Correlation. Correlation and Linear Regression. Three Questions.

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

ISyE 6644 Fall 2014 Test 3 Solutions

EE/CpE 345. Modeling and Simulation. Fall Class 5 September 30, 2002

Chapter 2. Continuous random variables

LECTURE 1. Introduction to Econometrics

Financial Econometrics and Quantitative Risk Managenent Return Properties

Random Variables and Their Distributions

Algebra of Random Variables: Optimal Average and Optimal Scaling Minimising

Institute of Actuaries of India

Asymptotic Statistics-III. Changliang Zou

Markov chain Monte Carlo

8 Basics of Hypothesis Testing

1 Degree distributions and data

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Physics 509: Bootstrap and Robust Parameter Estimation

Unit 10: Simple Linear Regression and Correlation

4. Distributions of Functions of Random Variables

Test Problems for Probability Theory ,

Statistics 3657 : Moment Approximations

Statistical Tests. Matthieu de Lapparent

Bivariate distributions

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Regression with correlation for the Sales Data

Parameter Estimation and Fitting to Data

Precept 4: Hypothesis Testing

Business Statistics. Lecture 10: Correlation and Linear Regression

Random Number Generation. CS1538: Introduction to simulations

Math 562 Homework 1 August 29, 2006 Dr. Ron Sahoo

ECE 650 1/11. Homework Sets 1-3

Computer projects for Mathematical Statistics, MA 486. Some practical hints for doing computer projects with MATLAB:

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Practice Exam 1. (A) (B) (C) (D) (E) You are given the following data on loss sizes:

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

STAT Chapter 5 Continuous Distributions

Stochastic Processes: I. consider bowl of worms model for oscilloscope experiment:

GOV 2001/ 1002/ E-2001 Section 3 Theories of Inference

Statistical Inference

Recall the Basics of Hypothesis Testing

Statistics Assignment 2 HET551 Design and Development Project 1

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Be sure to pledge your exam booklet and put your name on it.

Independent Events. Two events are independent if knowing that one occurs does not change the probability of the other occurring

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Joseph O. Marker Marker Actuarial a Services, LLC and University of Michigan CLRS 2010 Meeting. J. Marker, LSMWP, CLRS 1

L2: Review of probability and statistics

MAS223 Statistical Inference and Modelling Exercises

Random Variate Generation

Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

Solutions to the Spring 2015 CAS Exam ST

Algebra of Random Variables: Optimal Average and Optimal Scaling Minimising

1 Probability Distributions

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Class 26: review for final exam 18.05, Spring 2014

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Introduction to Scientific Modeling CS 365, Fall 2012 Power Laws and Scaling

This does not cover everything on the final. Look at the posted practice problems for other topics.

Probability Density Functions

Topic 17: Simple Hypotheses

The Model Building Process Part I: Checking Model Assumptions Best Practice

Multivariate Random Variable

SPRING 2007 EXAM C SOLUTIONS

ST 371 (IX): Theories of Sampling Distributions

Data modelling Parameter estimation

Preliminary Statistics. Lecture 3: Probability Models and Distributions

Simulation. Where real stuff starts

Using R in Undergraduate and Graduate Probability and Mathematical Statistics Courses*

TMA 4275 Lifetime Analysis June 2004 Solution

Multivariate Regression

Section 7.3: Continuous RV Applications

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

[y i α βx i ] 2 (2) Q = i=1

{X i } realize. n i=1 X i. Note that again X is a random variable. If we are to

Estadística II Chapter 4: Simple linear regression

Slides 5: Random Number Extensions

Model Fitting. Jean Yves Le Boudec

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Transcription:

EE/CpE 345 Modeling and Simulation Class 0 November 8, 2002

Input Modeling Inputs(t) Actual System Outputs(t) Parameters? Simulated System Outputs(t) The input data is the driving force for the simulation - the behavior of the simulation and all the results/conclusions that can be reached depend on appropriate inputs

Developing a Model of Input Data Collect data from real system Is there time to collect enough data? Are there other sources available to obtain relevant information? Identify a probability distribution to model observed data Start with a histogram of data to enable visualization. Is anything known about process? Choose parameters for a specific instance of distribution family Valid data is especially important for this step Evaluate goodness-of-fit χ 2 and Kolmogorov-Smirnov tests Repeat?

Data Collection Data collection can be the hardest tasks in solving a real problem Data collection is one of the most important and hardest tasks in simulation Data is often either scarce or overly abundant GIGO (Garbage-In = Garbage-Out often applies The simulation often abstracts real data, hiding its inadequacies Collect data from real system Identify a probability distribution to model observed data Choose parameters for a specific instance of distribution family Evaluate goodness-of -fit Suggestions to improve data collection: PLAN! Do some trial runs to see if there are any special circumstances that will have to be captured Analyze/summarize data during collection - this might highlight a problem with data being collected Look for homogeneity with plan to combine similar data sets Watch for data censoring - is an observation of a process complete? Or are the long procedures truncated artificially? Use a scatter plot to see relationships between variables - other senses help, as well Look for correlation in data Distinguish input data (independent variables) from performance data (dependent variables)

Identifying the Distribution Methods to identify distributions: Histograms Select the Family of Distributions Quantile-quantile plots Collect data from real system Identify a probability distribution to model observed data Choose parameters for a specific instance of distribution family Evaluate goodness-of -fit

Histograms Bins for individual observations Height proportional to number of observations Range of observed values

Histograms Bins for individual observations Height proportional to number of observations Range of observed values Number of bins ~ sqrt(observations)

Histograms - Picking the Number of Bins µ := 2 N := 000 ( ) σ := 2.5 R := rnorm N, µ, σ M := 5 M 2 := floor N M 3 := 00 ( ) h := histogram M, R h 2 := histogram M 2, R h 3 := histogram M 3, R ( ) ( ) ( ) k := 0.. M k 2 := 0.. M 2 k 3 := 0.. M 3 000 50 40 00 h k, 500 h 2k 2, 50 h 3k 3, 20 0 0 5 0 5 0 5 0 0 0 0 20 0 0 0 0 20 h k, 0 h 2k 2, 0 h 3k 3, 0 All Mathcad examples are in the file Class0.mcd

Meaning of the Histogram for Input Modeling 50 H := pnorm h k2 2k, µ, σ pnorm returns the cdf 2, 0 h := H 0 0 k :=.. M 2 2 h := H H convert to pdf k k k h := h N h 2k 2, 00 h k 2 50 0 0 5 0 5 0 5 h 2k 2, 0 Scaled by the total number of points, the histogram approximates the p.d.f. of the input distribution Use the histogram to visualize the p.d.f. of the observed data to enable selection of a known distribution function

Selecting the Family of the Distribution There are a large number of probability distributions generated from observations of the real world proposed as theoretical models What is known about the physical characteristics of the input process? Is it naturally discrete or continuous valued? Are the observable values inherently bounded or is there no natural bound? Can you infer a distribution from what you know about the process that generates input values? E.g., Normal (Gaussian) process is derived from the sum of a large number of independent random variables E.g., Erlang process is sum of several exponential processes E.g., Lognormal process is derived from product of several component processes E.g., Poisson process models the number of independent events that occur in a bounded period of time or area in space.

Quantile-Quantile Plots N := 30 ( ) R := rnorm N, µ, σ S := sort( R) mean( R) =.604 var( R) = 5.458 j :=.. N γ j := j 2 N INV := qnorm γ j j, mean( R), var( R) 0 ( ) Given a set of input data, R Calculate µ and σ 2 Sort R (S in this Mathcad file) Generate a set of γ i, evenly distributed between 0 and For a given assumed distribution (using calculated µ and σ 2 ), calculate the inverse of the c.d.f. for each γ Plot the sorted data vs. the calculated values S j 5 0 If the assumed distribution matches, the plot should be a straight line with slope=, intercept=0 5 4 2 0 2 4 6 8 INV j

Q-Q Plots with Incorrect Assumptions Using the wrong distribution: Generating Exponentially distributed random numbers Matching a Normal distribution The resulting plot is not linear, as expected. If you have Mathcad or the Mathcad viewer*, try experimenting with other distributions, parameters N := 000 ( ) R := rexp N, µ S := sort( R) mean( R) = 0.504 var( R) = 0.24 j :=.. N γ j := j 2 N ( ) INV := qnorm γ j j, mean( R), var( R) 5 4 3 S j 2 0 0.5 0 0.5.5 INV j

Parameter Estimation Mean and variance: n samples: X = S = n i= n n 2 i= X X i 2 2 i nx n Discrete data, grouped in k frequency distribution classes: Collect data from real system Identify a probability distribution to model observed data Choose parameters for a specific instance of distribution family Evaluate goodness-of -fit X = S 2 = k j= k j= fx n j fx j j 2 2 j nx n

Parameter Estimation Mean and variance: Continuous data in c frequency classes when raw data is not available. m j are the midpoints of the frequency classes: X S 2 c j= c j= fm n j fm j j 2 2 j nx n

Parameter Estimation Distribution Poisson Exponential Normal Parameter(s) α λ µ, σ 2 Suggested Estimator(s) α ˆ = X ˆ λ = X µ ˆ = X σˆ = S 2 2 The estimated parameters using the suggested estimators are maximum-likihood estimators, based on raw data. The true parameters, assuming that the distribution was known, are not expected to be the same as the experimentally measured parameters small sample size noise, randomness in measurements

Parameter Estimation µ := 3.5 σ := 2 λ :=. N norm := 50 N exp := 50 R norm := rnorm( N norm, µ, σ) R exp := rexp N exp, λ mean( R norm ) = 3.883 mean( R exp ) = 0.093 var( R norm ) =.653 var( R exp ) = 0.38 These should be equal Distinguish the parameter of the distribution: From the estimator or statistic: α ˆα

Goodness-of-Fit Hypothesis testing was introduced 2 weeks ago to test random number distributions Kolmogorov-Smirnov χ 2 Goodness-of-fit tests should guide the choice of a distribution, not establish it: there is often no perfect answer with real-world data. Collect data from real system Identify a probability distribution to model observed data Sample size can have a significant effect on results: With a small number of data points, few or no candidate distributions will be rejected With a large number of data points, almost all candidate distributions will be rejected. Choose parameters for a specific instance of distribution family Evaluate goodness-of -fit

Goodness-of-Fit Tests χ 2 test: Compares the histogram of candidate density function Valid for large sample sizes Assumes parameters are estimated by maximum likelihood function χ 2 test with equal probabilities: If distribution is assumed to be continuous, class intervals should be equal probability, rather than equal width. Example 9.4 (next slide) Kolmogorov-Smirnov: With χ 2 test, grouping of data may influence accept/reject decision K-S test is based on examining a q-q plot Especially useful when no parameters have been estimated from data

Example 9.4 - χ 2 Test for Exponential Distribution The data, X is generated from an exponential distribution, and the χ 2 test is applied. Here, χ o 2 =5.92, which exceeds the tabulated value for a significance of 0.0, but not at a significant of 0.05 - we might reject this distribution if the level of significance were tight enough n := 50 X := k := 8 a := k O := a = 0.306 2.84 4.597 6.78 9.593 ( ) rexp n, λ i := 0.. k 3.559 20.339 0 307 k χ sq := i = 0 λ :=. λ hat := mean( X) p := k hist( a, X s ) X s := O = ( ) 2 O E i i E i sort( X) λ hat = 0.02 a := ln( i p) i λ hat E := i O E i i term := i E i 4 9 3 7 8 4 4 p = 0.25 p n ( ) 2 E = 6.25 6.25 6.25 6.25 6.25 6.25 6.25 6.25 χ sq = 5.92 term = 0.8 4.4.2 7.29 0.09 0.49 0.8 0.8

p-value and Best Fits In the prior example, with the particular data examined, the hypothesis that the data came from an exponential distribution would have been rejected at a significance level of 0.0, but not at a significance level of 0.05. How should you choose a significance level? 0.0, 0.05, & 0.0 are commonly used. The significance level is equivalent to the probability of falsely rejecting H 0 Many software packages compute a p-value: The p-value is the significance level which just rejects H 0 The p-value can be viewed as a measure of fit: larger p-value indicates a better fit One possible approach to choosing a distribution: Test every distribution available, choose the one that has largest p-value

Selecting Input Models without Data What about where there is no input data available to model, e.g., for a preliminary study when no time or funds are available to gather data? Creating data where none exists: base it on published performance data get expert opinion for the same or similar systems - this might help to bound or otherwise characterize inputs use physical limitations to bound problem (e.g., maximum possible car arrivals at an intersection is related to minimum car spacing and maximum velocity) The nature of the process: use the descriptions of various distributions to pick the one most closely related to the underlying input process Uniform, triangular, and beta distributions are used when nothing else is known

Multivariate and Time Series Input Models Previous discussion dealt with independent processes What if multiple inputs to the simulation are related to each other? multivariate input models with a fixed, finite number of random variables time-series input models of a sequence of related random variables There are two measures of dependence between random variables: Covariance Correlation

Multivariate and Time Series Input Models X and X 2 are two random variables: µ i =E(X i ) and σ 2 i = Var(X i ) Covariance and correlation define how well the relationship between X and X 2 is described by: ( X µ ) = β ( X µ ) + ε 2 2 ε is a zero mean R.V., independent of X 2 Covariance: cov( X, X ) = E ( X µ )( X µ ) = EXX ( ) µµ [ ] 2 2 2 2 2 Correlation: cov( X, X2) ρ = corr( X, X2) = σσ 2 Observations about cov( ) and corr( )

Multivariate and Time Series Input Models If X, X 2, are identically distributed (but possibly dependent), this is referred to as a time-series. cov(x t, X t+h ) and corr(x t, X t+h ) are the lag-h covariance and correlation If cov(x t, X t+h ) depends on h and not t, the time series is covariance-stationary and can be represented as: ρ h t t+ h = corr( X, X )

Homework 0 Ch. 9, p. 36 exercises 8, 9, 3