Statistical Analysis of Engineering Data The Bare Bones Edition. Precision, Bias, Accuracy, Measures of Precision, Propagation of Error

Similar documents
Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

What If There Are More Than. Two Factor Levels?

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Lectures 5 & 6: Hypothesis Testing

Review of Statistics 101

The SuperBall Lab. Objective. Instructions

Practical Statistics for the Analytical Scientist Table of Contents

-However, this definition can be expanded to include: biology (biometrics), environmental science (environmetrics), economics (econometrics).

Chapter 1 Statistical Inference

1 Introduction to Minitab

Relating Graph to Matlab

PHYS 2211L - Principles of Physics Laboratory I

Subject CS1 Actuarial Statistics 1 Core Principles

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Index. Cambridge University Press Data Analysis for Physical Scientists: Featuring Excel Les Kirkup Index More information

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

A discussion on multiple regression models

Data Analysis. with Excel. An introduction for Physical scientists. LesKirkup university of Technology, Sydney CAMBRIDGE UNIVERSITY PRESS

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

PHYS 1111L - Introductory Physics Laboratory I

Basic Analysis of Data

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Uncertainty in Physical Measurements: Module 5 Data with Two Variables

This gives us an upper and lower bound that capture our population mean.

Inferences for Regression

Typing Equations in MS Word 2010

Measurements and Data Analysis

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

PLS205 Lab 2 January 15, Laboratory Topic 3

Lack-of-fit Tests to Indicate Material Model Improvement or Experimental Data Noise Reduction

Formal Statement of Simple Linear Regression Model

Unit 27 One-Way Analysis of Variance

Sampling Distributions: Central Limit Theorem

Calibrate Rotameter and Orifice Meter and Explore Reynolds #

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

Cool Off, Will Ya! Investigating Effect of Temperature Differences between Water and Environment on Cooling Rate of Water

Factorial designs. Experiments

Data Analysis and Statistical Methods Statistics 651

Final Exam - Solutions

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Ordinary Least Squares Regression Explained: Vartanian

ISQS 5349 Final Exam, Spring 2017.

Quantitative Techniques - Lecture 8: Estimation

INFERENCE FOR REGRESSION

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Business Statistics. Lecture 10: Course Review

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

Simple Linear Regression: One Quantitative IV

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Chapter 9 Regression. 9.1 Simple linear regression Linear models Least squares Predictions and residuals.

Module 1: Introduction to Experimental Techniques Lecture 6: Uncertainty analysis. The Lecture Contains: Uncertainity Analysis

MICROPIPETTE CALIBRATIONS

Analysis of variance

Econometrics. 4) Statistical inference

Basic Statistical Analysis!

Hypothesis testing. Data to decisions

1 Measurement Uncertainties

Robustness and Distribution Assumptions

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Sociology 6Z03 Review II

Correlation and Simple Linear Regression

Fundamental Statistical Concepts and Methods Needed in a Test-and-Evaluator s Toolkit. Air Academy Associates

CHAPTER 10 ONE-WAY ANALYSIS OF VARIANCE. It would be very unusual for all the research one might conduct to be restricted to

Data Analysis for University Physics

Midterm 2 - Solutions

Modules 1-2 are background; they are the same for regression analysis and time series.

Uncertainty in Physical Measurements: Module 5 Data with Two Variables

A Scientific Model for Free Fall.

Experimental Uncertainty (Error) and Data Analysis

HYPOTHESIS TESTING. Hypothesis Testing

IENG581 Design and Analysis of Experiments INTRODUCTION

df=degrees of freedom = n - 1

2.2 Classical Regression in the Time Series Context

Do not copy, post, or distribute

Precision Correcting for Random Error

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Statistics Introductory Correlation

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

INTRODUCTION TO ANALYSIS OF VARIANCE

Final Exam. Name: Solution:

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Correlation Analysis

Statistics: Error (Chpt. 5)

16.400/453J Human Factors Engineering. Design of Experiments II

9. Linear Regression and Correlation

03.1 Experimental Error

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

Statistical methods and data analysis

STA 101 Final Review

1 Descriptive statistics. 2 Scores and probability distributions. 3 Hypothesis testing and one-sample t-test. 4 More on t-tests

BIOMETRICS INFORMATION

Linear Algebra is Your Friend: Least Squares Solutions to Overdetermined Linear Systems

Experiment 2 Random Error and Basic Statistics

Simple Linear Regression Using Ordinary Least Squares

Propagation of Error. Ch En 475 Unit Operations

Chapter 7: Simple linear regression

Transcription:

Statistical Analysis of Engineering Data The Bare Bones Edition (I) Precision, Bias, Accuracy, Measures of Precision, Propagation of Error PRIOR TO DATA ACQUISITION ONE SHOULD CONSIDER: 1. The accuracy and/or precision required in your data and calculations should be consistent with the projected use of the result. 2. It is desirable to complete the investigation and obtain the required accuracy and/or precision with a minimum of time and expense. TYPICAL QUESTIONS TO ASK: 1. What types of errors are likely to enter a given measurement or calculation? 2. To what extent do the errors in data influence the error in a result calculated from the data? (propagation of error) 3. What accuracy and/or precision is it economically feasible to obtain? 4. Is absolute accuracy and/or precision needed, or merely trends or differences? Types of uncertainties: 1. Random or indeterminate errors. The word Precision is used to describe random error. Precision is best quantified by the sample * variance (S 2 ) or sample * standard deviation (S) of values obtained by repeating trials several times. Example: n (=5) repeated trials from the draining tank experiment yield the following data: Trial # Drain time (s) 1 25 2 27 3 24 4 25 5 30 1 sample mean or average = x m = * x1 x2... Eqn. 1 n = 1/5 (25+27+24+25+30) = 26.2 s 2 2 1 2 sample standard deviation = S = * x1 xm x2 x m... 1 Eqn. 2 n = [(1/4{(25-26.2) 2 + (27-26.2) 2 + (24-26.2) 2 + (25-26.2) 2 + (30-26.2) 2 }] 1/2 = 2.4 s * A sample represents a finite selection of items from an entire population. For instance, in the draining tank example, the entire population consists of infinite replicate determinations of draining time. Samples provide an image of the population. One of the severest limitations of a sample, as a representation of the population, is its size. However, the size of a sample is not the only factor affecting its merit as an image of the population. It must be noted that x m is the sample estimate of the population mean ( ) and S 2 is the sample estimate of the population variance ( 2 ). 1

2. Fixed or systematic errors a consistent or repeated error. Systematic error is called Bias. Bias is determined by comparing the sample mean to the reference value of the measured property. Reference values can be (a) real, though unknown true values; (b) assigned values, arrived at by agreement among experts; or (c) the result of a hypothetical experiment which is approximated by a sequence of actual experiments. Only type (b) reference values lend themselves to mathematical description of bias, i.e. bias = sample mean assigned value. Total Accuracy ** is total absence of bias. **Sometimes accuracy is defined as the total absence of error (both bias and random error). STATISTICS deals only with precision, not with bias (i.e. accuracy). Possible reasons for Poor Precision fluctuations in surroundings, fluctuations in system, sloppy sampling or recording by a human, data taken by different people, fluctuations in measuring device, magnitude of the measured value, etc. Possible reasons for Bias improper calibration of measuring device, malfunction in measuring device, consistent human error in sampling or measuring, etc

The precision in the measurement of dependent and independent variables can only be assessed if replicate measurements are made at a single experimental condition (i.e. you must be able to compute a mean and a standard deviation of a measured value within a single experimental condition). Precision can be reported many different ways. 1) Standard deviation provides a measure of precision by indicating how far, on average, replicate observations are from their mean. Coefficient of variation is a convenient way to report precision based on std dev. Coeff of variation = (std dev/mean)*100 2) Confidence Intervals also provide a measure of our precision (and are related to std dev). A confidence interval = range of values between which the population mean can be expected to lie with a given certainty they provide a probability statement about the likelihood of correctness. The two-sided 95% Confidence Interval (based on the Student-t distribution) is given by: S S xm t0.025 * xm t0. 025 * Eqn. 3 n n where t 0.025 is the distribution factor for the Student-t distribution (obtainable from tables provided in most statistics textbooks). Note- for a normally distributed variable, ~95% of all possible observations lie within two standard deviations to either side of the mean. Note: Most spreadsheet and graphing software will compute sample statistics including confidence intervals when given data and associated information. CHEG 237W - Excel calculation of sample statistics and confidence intervals Draining Tank Experiment Replicate Data for drain time of water from a tank Run 1 - pipe length = 10", diameter = 0.25 Trial # Time (s) Required to Drain 6 of fluid 1 25 2 27 3 24 4 25 5 30 sample size 5 Average 26.2 Excel command for Average = Average(list of numbers) Stand Dev 2.4 Excel command for stand dev = STDEV(list of numbers) 95% Confidence Interval = 26.2 2.1 Excel command for conf = Confidence(alpha, stdev, sample size) Where alpha = (100-confidence level desired)/100 i.e. alpha=0.05 for 95% confidence 3) Error Bars. In Excel, the value of error bars on data points can be based on 1) fixed values you define, 2)%of each data point, 3) specific number of std dev from the mean of the plotted values 4) std error of the plotted values 5) custom values you define (i.e. you could put confidence values in here).

Propagation of error refers to uncertainty in a computed quantity (u) resulting from uncertainties in the primary measurements (x,y,z) on which it is based. 1. If (u) is a linear function of the primary measurements (x, y, z) i.e. u = ax + by + cz and x, y, and z are statistically independent* measurements the variance in u is given by: 2 (u) = 2 (ax + by + cz) = a 2 2 (x) + b 2 2 (y) + c 2 2 (z) Eqn. 4 * Statistical independence means that the error in the measurement of x has no effect upon the error made in the measurement of y and z, etc. Example: The iron content of a copper-based alloy, determined by a backtitration method is calculated according to the formula: %Fe = 0.01117 (D-F) where D=ml 0.01 N potassium dichromate used in the analysis and F= ml of 0.01N ferrous-ammonium sulfate required for titration. Suppose D=21.0 ml and F=0.8 ml. Then % Fe = 0.01117(21.0-0.8)=0.226. If the standard deviation of the titrations with either D or F is 0.1 ml, then the variance in %Fe calculated is 2 (%Fe) = (0.01117) 2 (0.1) 2 + (0.01117) 2 (0.1) 2 = 2.4954x10-6 (%Fe) = (2.4954x10-6 ) 1/2 = 0.00158 (% Fe) Coefficient of variation = * 100 % Fe 0.00158 = 100 = 0.69% 0.226 2. If (u) is a non-linear function of the primary measurements, for example u = xyz, and the measurements are statistically independent a method of linearization based on a Taylor expansion can be used. Suppose u = f(x,y,z) 2 2 2 f 2 f 2 f 2 ( u) ( x) ( y) ( z) Eqn. 5 x y z Note- this method is valid when (x) is of the order of 10% x or smaller Example: Electric power is calculated using the formula P = V*I where V = voltage = 100 V 2V and I= current = 10A 0.2 A 2 (P) = (10) 2 (2) 2 + (100) 2 (0.2) 2 = 800 (P) = 28.3 We would report the power as P = 1000 W 28.3 W. The coefficient of variation is 2.8%. 2

II. Hypothesis Testing (comparing two averages) & Analysis of Variance (ANOVA) (comparing more than two averages) Hypothesis testing is the formal procedure for determining statistical significance when comparing two methods, two operating conditions, two materials, etc. In other words, hypothesis testing tells us whether a change in level of the independent variable resulted in a significant change in the dependent variable. Hypothesis tests and confidence intervals are closely related. The procedure is: 1. Formulate hypotheses 2. Select appropriate distribution function for the test statistic 3. Specify the level of significance 4. Collect and analyze data and calculate an estimate of the test statistic 5. Define the region of rejection for the test statistic 6. Accept or reject hypotheses Hypotheses (when used for comparing two populations ) are statements that indicate the absence or presence of differences. Hypotheses consist of pairs of statements. The first hypothesis (null hypothesis) states that the sample means of two populations are equal The two means are equal: x m1 = x m2 The second hypothesis (alternative hypothesis) can be stated various ways: The two means are different: Mean 1 is less than mean 2: Mean 1 is greater than mean 2: x m1 x m2 x m1 < x m2 x m1 > x m2 A Distribution function is a mathematical model describing the probability that a random variable X lies within the interval x1 to x2. The location, scale and shape of the function are usually characterized by one to three parameters (mean, variance, skewness). Uniform, normal, lognormal, student-t, and chi-squared are some examples of continuous distribution functions. The test statistic is the distribution function that you choose to represent your particular testing requirements. When comparing random samples drawn from two normally distributed populations, the student-t distribution is used: xm 1 xm2 Student-t test statistic: t 0. 5 1 1 S p n1 n2 Eqn. 6 where S p the pooled variance is given by 2 n 1 S n 1 2 2 1 1 2 S 2 S p Eqn. 7 n n 2 S 2 is the sample variance, n is the sample size and the subscript 1 and 2 represent populations 1 and 2. 1 2

The level of significance ( ) represents the probability of rejecting the null hypothesis when indeed it is true. Selection of the level of significance should be based on an evaluation of the consequences of making an incorrect conclusion (in your case, a false conclusion is not life threatening or is it?!) Typical values for are 0.05 or 0.01. The region of rejection consists of those values of the test statistic that would be unlikely if the null hypothesis were true. For a hypothesis test of two means, the region of rejection is a function of the degrees of freedom (n1+n2-2), level of significance ( ), and the statement of the alternative hypothesis: If the alternative hypothesis is then reject the null hypothesis if x m1 x m2 x m1 < x m2 x m1 > x m2 t < -t /2 or t > t /2 t < -t t > t t is the value calculated using Eqns. 6 & 7. t and t /2 are found in the student-t table. Accept or reject the null hypothesis by comparing the sample estimate of the t statistic calculated using experimental data with tabulated values of t or t /2. Example: CHEG 237W - Excel calculation of sample statistics and hypothesis testing Draining Tank Experiment - Data for drain time of water from tanks with different length outlet pipes Run 1 - pipe length = 10" Run 2 - pipe length = 1" Trial # Time (s) Trial # Time (s) 1 25 1 35 2 27 2 36 3 24 3 35 4 25 4 37 5 30 5 36 sample size 5 5 Average 26.2 35.80 Stand Dev 2.4 0.84 Confidence Interval = 26.2 2.1 35.8 0.7

Error bars represent the confidence intervals calculated above Hypothesis Test of drain times from 1" and 10" length pipes Null hypothesis = means of the two runs are equal Alternative hypothesis = mean of the 10" pipe is less than mean of the 1" pipe calculated t = (26.2-35.8)/Sp/(1/5+1/5)^.5) t= -8.48 where Sp 2 ={(5-1)*(2.39)^2+(5-1)*(.84)^2}/(5+5-2) Sp 2 = 3.2 t from t-table (for 5+5-2 degrees of freedom and level of significance = 0.1) t = 1.397 test = reject null hypothesis if t< -t -8.48 is less than -1.397 so we can be 90% certain that the null hypothesis is false!!!! i.e. run 1 time is significantly different from run 2 time

It is often necessary to compare more than two means. The analysis of variance (ANOVA) methodology is used in these cases. Essentially this analysis determines whether the discrepancies between means are greater than could reasonably be expected within the means. The following is an example from Introductory Statistics by Weiss for performing oneway ANOVA (comparing N number of means of a single dependent variable that result by varying N levels of a single independent variable i.e. comparing 3 draining times resulting from 3 different outlet pipe lengths).

III. Data Regression and Correlation After data are taken and sample statistics computed (mean, standard deviation, confidence intervals, test of significance) it must be analyzed. You need to make sense of the data, explain the physical phenomena that are occurring, and show how your data conform or not conform to theory or other experimental work. Before data regression and correlation, you should anticipate the types of behavior you expect to observe between independent and dependent variables. For instance, a popular empirical equation used to relate pressure drop in pipes to fluid and pipe properties is P/ = 4fL/D V 2 /2, therefore one might expect measured pressure drop to be proportional to V 2 for constant density, diameter, length, and friction factor. Data Regression refers to the process of statistically deriving a mathematical relationship between a dependent variable and measured values of independent variables i.e. deriving an empirical model. Linear regression using the least-squares criterion is an example of model development from experimental data. The utility of any model for making predictions (be it a regression model or a model based on physical laws or theory) can be assessed using correlation methods, hypothesis tests, or prediction intervals which are similar to confidence intervals. Data Correlation provides information about the strength and type of relationship between two or more variables or the strength of the relationship between observed values and predicted values i.e. the degree to which one or more variables can be used to predict the values of another variable. 1) Coefficient of determination, r 2, is the (sum of the squared errors between the predicted value and the mean)/(sum of the squared errors between the observed value and the mean). It is the squared value of r, the correlation coefficient. 2) Hypothesis testing (performed on slope or r or r 2 ) 3) Prediction intervals A few examples of data regression and correlation are given below. (1) Fitting a line to experimental data (calculating slope and intercept) Example: calibrating a rotameter (rotameter reading vs flowrate) Solution Method: linear regression (method of least squares) with determination of variance, confidence intervals ** and residual plots plots of [measured values of dependent variable minus predicted values of the dependent variable] vs the independent variable. Residual plots should fall roughly in a horizontal band centered and symmetric about the x-axis. A normal probability plot of residuals should be roughly linear. (2) Fitting curves to experimental data ** (calculating polynomial coefficients and/or model parameters) Example: developing an expression for the heat capacity of a gas as a function of temperature Solution Method: polynomial and/or multiple linear and nonlinear regression with determination of variance, confidence intervals and residual plots (3) Estimating parameters in existing models from raw data ** (finding the best fit coefficients for existing models)

Example: finding Antoine constants [log p = A + B/(T+C)] from vapor pressure vs temperature data Solution Method: nonlinear regression of general algebraic expression with determination of variance, confidence intervals and residual analysis (4) Comparing models derived from experimental data to existing models ** Example: correlation of heat transfer data using dimensionless groups followed by comparison to existing empirical models Solution Method: linear and/or non-linear regression with linearization and transformation functions, determination of variance, confidence intervals, and residual analysis (5) Comparing the effects of various independent variables on the dependent variable (determining the relative significance of independent variables) Example: comparing the effects of pipe diameter, pipe length, and viscosity on draining time Solution Method: analysis of variance (ANOVA) equivalent to regression analysis with determination of variance, confidence intervals and residual analysis **Polymath example attached