Alan Bundy University of Edinburgh (Slides courtesy of Paul Cohen)

Similar documents
Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Outline. Confidence intervals More parametric tests More bootstrap and randomization tests. Cohen Empirical Methods CS650

Probability and Statistics

The Inductive Proof Template

CS 160: Lecture 16. Quantitative Studies. Outline. Random variables and trials. Random variables. Qualitative vs. Quantitative Studies

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

Homework 2: Simple Linear Regression

Analysis of Variance. One-way analysis of variance (anova) shows whether j groups have significantly different means. X=10 S=1.14 X=5.8 X=2.4 X=9.

Beyond p values and significance. "Accepting the null hypothesis" Power Utility of a result. Cohen Empirical Methods CS650

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

arxiv: v1 [math.gm] 23 Dec 2018

Visual interpretation with normal approximation

appstats27.notebook April 06, 2017

This gives us an upper and lower bound that capture our population mean.

Chapter 27 Summary Inferences for Regression

20 Hypothesis Testing, Part I

Investigating Factors that Affect Erosion

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

PS2.1 & 2.2: Linear Correlations PS2: Bivariate Statistics

Mathematical Proofs. e x2. log k. a+b a + b. Carlos Moreno uwaterloo.ca EIT e π i 1 = 0.

MITOCW ocw f99-lec23_300k

Descriptive Data Summarization

Applied Statistics and Econometrics

Sample Size Determination

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Two-Sample Inferential Statistics

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

Chi Square Analysis M&M Statistics. Name Period Date

Bag RED ORANGE GREEN YELLOW PURPLE Candies per Bag

STA Module 10 Comparing Two Proportions

Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research

Simple Linear Regression for the Climate Data

Summative Practical: Motion down an Incline Plane

review session gov 2000 gov 2000 () review session 1 / 38

Descriptive Statistics-I. Dr Mahmoud Alhussami

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

Outline. 1. Basics of gravitational wave transient signal searches. 2. Reconstruction of signal properties

Review of Statistics 101

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Experiment 2 Random Error and Basic Statistics

Quartiles, Deciles, and Percentiles

Mathematical Notation Math Introduction to Applied Statistics

Stat 231 Exam 2 Fall 2013

a) The runner completes his next 1500 meter race in under 4 minutes: <

Data Collection and Statistical Inference

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

MITOCW ocw f99-lec16_300k

Know Your Uncertainty

California State Science Fair

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

Chapter 2: Tools for Exploring Univariate Data

CS 580: Algorithm Design and Analysis

Simulation. Where real stuff starts

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Job Scheduling and Multiple Access. Emre Telatar, EPFL Sibi Raj (EPFL), David Tse (UC Berkeley)

1 / 9. Due Friday, May 11 th at 2:30PM. There is no checkpoint problem. CS103 Handout 30 Spring 2018 May 4, 2018 Problem Set 5

Mathematical Proofs. e x2. log k. a+b a + b. Carlos Moreno uwaterloo.ca EIT e π i 1 = 0.

Time in Distributed Systems: Clocks and Ordering of Events

Passing-Bablok Regression for Method Comparison

Degrees-of-freedom estimation in the Student-t noise model.

Glossary for the Triola Statistics Series

CE 3710: Uncertainty Analysis in Engineering

Simulation. Where real stuff starts

Time Analysis of Sorting and Searching Algorithms

Proof by contrapositive, contradiction

Lesson 3: Advanced Factoring Strategies for Quadratic Expressions

SVM optimization and Kernel methods

Basic Statistics. 1. Gross error analyst makes a gross mistake (misread balance or entered wrong value into calculation).

Introduction to Statistics

Column: The Physics of Digital Information 1

Chapter 12 - Part I: Correlation Analysis

How Measurement Error Affects the Four Ways We Use Data

Physics 509: Bootstrap and Robust Parameter Estimation

Probability and Statistics. Terms and concepts

Independent Samples t tests. Background for Independent Samples t test

Ch18 links / ch18 pdf links Ch18 image t-dist table

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

What is a Hypothesis?

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

COUNTING ERRORS AND STATISTICS RCT STUDY GUIDE Identify the five general types of radiation measurement errors.

Chapter 1 Statistical Inference

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Econ 325: Introduction to Empirical Economics

Evolving a New Feature for a Working Program

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

- measures the center of our distribution. In the case of a sample, it s given by: y i. y = where n = sample size.

Lesson 8: Graphs of Simple Non Linear Functions

Introduction to Linear Regression

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Topics in Complexity

Are the Digits in a Mersenne Prime Random? A Probability Model

Linear Regression with Multiple Regressors

Warm-up Using the given data Create a scatterplot Find the regression line

Sign test. Josemari Sarasola - Gizapedia. Statistics for Business. Josemari Sarasola - Gizapedia Sign test 1 / 13

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Analysis of variance (ANOVA) Comparing the means of more than two groups

Rigorous Science - Based on a probability value? The linkage between Popperian science and statistical analysis

Transcription:

Empirical Methods Alan Bundy University of Edinburgh (Slides courtesy of Paul Cohen)

Outline Lesson 1: Evaluation begins with claims Lesson 2: Exploratory data analysis means looking beneath results for reasons Lesson 3: Run pilot experiments Lesson 4: Control sample variance, rather than increase sample size. Lesson 5: Check result is significant.

Lesson 1: Evaluation begins with claims The most important, most immediate and most neglected part of evaluation plans. What you measure depends on what you want to know, on what you claim. Claims: X is bigger/faster/stronger than Y X varies linearly with Y in the range we care about X and Y agree on most test items It doesn't matter who uses the system (no effects of subjects) My algorithm scales better than yours (e.g., a relationship between size and runtime depends on the algorithm) Non-claim: I built it and it runs fine on some test data

Case Study: Comparing two algorithms Scheduling processors on ring network; jobs spawned as binary trees KOSO: keep one, send one to my left or right arbitrarily KOSO*: keep one, send one to my least heavily loaded neighbour Theoretical analysis went only so far, for unbalanced trees and other conditions it was necessary to test KOSO and KOSO* empirically An Empirical Study of Dynamic Scheduling on Rings of Processors Gregory, Gao, Rosenberg & Cohen, Proc. of 8th IEEE Symp. on Parallel & Distributed Processing, 1996

Evaluation begins with claims Hypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better The because phrase indicates a hypothesis about why it works. This is a better hypothesis than the "beauty contest" demonstration that KOSO* beats KOSO Experiment design Independent variables: KOSO v KOSO*, no. of processors, no. of jobs, probability job will spawn, Dependent variable: time to complete jobs

Useful Terms Independent variable: A variable that indicates something you manipulate in an experiment, or some supposedly causal factor that you can't manipulate such as gender (also called a factor) Dependent variable: A variable that indicates to greater or lesser degree the causal effects of the factors represented by the independent variables F1 X1 Y F2 X2 Independent variables Dependent variable

Initial Results Mean time to complete jobs: KOSO: 2825 (the "dumb" algorithm) KOSO*: 2935 (the "load balancing" algorithm) KOSO is actually 4% faster than KOSO*! This difference is not statistically significant (more about this, later) What happened?

Lesson 2: Exploratory data analysis means looking beneath results for reasons Time series of queue length at different processors: Queue length at processor i 50 40 30 20 10 KOSO 100 200 300 Queue length at processor i 30 20 10 KOSO* 100 200 300 Unless processors starve (red arrow) there is no advantage to good load balancing (i.e., KOSO* is no better than KOSO)

Useful Terms Time series: One or more dependent variables measured at consecutive time points 50 40 KOSO Time series of queue length at processor "red" 30 20 10 100 200 300

Lesson 2: Exploratory data analysis means looking beneath results for reasons KOSO* is statistically no faster than KOSO. Why? 80 70 60 50 40 30 20 10 KOSO 70 60 50 40 30 20 10 KOSO* 5 10 15 20 5 10 15 20 Time to complete jobs (in thousands) Time to complete jobs (in thousands) Outliers dominate the means, so test isn t significant

Useful Terms Frequency distribution: The frequencies with which the values in a distribution occur (e.g., the frequencies of all the values of "age" in the room) Outlier: Extreme, low-frequency values. Mean: The average. frequencies Means are very sensitive to outliers. 80 70 60 50 40 30 20 10 Outliers: extreme and infrequent 5 10 15 20 25 values

More exploratory data analysis Mean time to complete jobs: KOSO: 2825 KOSO*: 2935 Median time to complete jobs KOSO: 498.5 KOSO*: 447.0 Looking at means (with outliers) KOSO* is 4% slower but looking at medians (robust against outliers) it is 11% faster.

Useful Terms Median: The value which splits a sorted distribution in half. The 50th quantile of the distribution. Quantile: A "cut point" q that divides the distribution into pieces of size q/100 and 1- (q/100). Examples: 50th quantile cuts the distribution in half. 25th quantile cuts off the lower quartile. 75th quantile cuts off the upper quartile. 1 2 3 7 7 8 14 15 17 21 22 Mean: 10.6 Median: 8 1 2 3 7 7 8 14 15 17 21 22 1000 Mean: 93.1 Median: 11

How are we doing? Hypothesis (or claim): KOSO takes longer than KOSO* because KOSO* balances loads better Mean KOSO is shorter than mean KOSO*, median KOSO is longer than KOSO*, no evidence that load balancing helps because there is almost no processor starvation in this experiment. Now what? Queue length at processor i 50 40 30 KOSO Queue length at processor i 30 20 KOSO* 20 10 10 100 200 300 100 200 300

Lesson 3: Always run pilot experiments A pilot experiment is designed less to test the hypothesis than to test the experimental apparatus to see whether it can test the hypothesis. Our independent variables were not set in a way that produced processor starvation so we couldn't test the hypothesis that KOSO* is better than KOSO because it balances loads better. Use pilot experiments to adjust independent and dependent measures, see whether the protocol works, provide preliminary data to try out your statistical analysis, in short, test the experiment design.

Next steps in the KOSO / KOSO* saga Queue length at processor i 50 40 KOSO Queue length at processor i 30 KOSO* 30 20 20 10 10 100 200 300 100 200 300 It looks like KOSO* does balance loads better (less variance in the queue length) but without processor starvation, there is no effect on run-time Cohen ran another experiment, varying the number of processors in the ring: 3, 9, 10 and 20 Once again, there was no significant difference in run-time. Why? Problem variance dominates algorithm variance.

Causes of Variance Koso* Koso With constant runtimes the variance in runtime would be due only to the difference between the algorithms. runtime If runtimes were variable due to one cause, say job spawning, the algorithms would still be easy to distinguish. runtime But runtimes are variable due to several causes, i.e. probability of job spawning, the number of processors and the number of job, so the variance in runtime is quite high and the algorithms are difficult to distinguish. runtime

Lesson 4: Control sample variance Suppose you are interested in which algorithm runs faster on a batch of problems but the run time depends more on the problems than the algorithms Mean difference looks small relative to variability of run time Run times for Algorithm 1 and Algorithm 2 You don't care very much about the problems, so you'd like to transform run time to "correct" the influence of the problem. This is one kind of variance-reducing transform.

What causes run times to vary so much? Run time depends on the number of processors and on the number of jobs (size). The relationships between these and run time are different for KOSO and KOSO* Green: KOSO Red: KOSO* 20,000 15,000 3 procs. 6 procs. R U N - T I M E Run Time 20000 10000 R U N - T I M E 10000 10000 Size SIZE Size[Ring-Experiment-Where.Row-Number.Rem.10.Eq ing-experiment-where.row-number.rem.10.eq.0-whe 10 procs. 7000 7000 6000 10000 7000 20 procs. Size SIZE e[ring-experiment-where.row-number.r -Experiment-Where.Row-Number.Rem.10. 6000 R U N - T I M E 5000 Run Time 4000 3000 2000 1000 R U N - T I M E 5000 4000 3000 2000 1000 10000 Size SIZE Size[Ring-Experiment-Where.Row-Number.Rem 10000 20000 30000 40000 50000 60000 70000 SIZE Size Size[Ring-Experiment-Where.Row-Number.Rem.10.E Ring-Experiment-Where.Row-Number.Rem.10.Eq.0-Wh

What causes run times to vary so much? R U N - T I M E 8000 7000 6000 5000 4000 3000 2000 Run time KOSO Run time decreases with the number of processors, and KOSO* appears to use them better, but the variance is still very high (confidence intervals) 1000 KOSO* 5 10 15 20 NUM-PROC Number of Processors up-means-plot OF Num-Proc[Ring-Experiment-Where.Row-Numb VS Run-Time[Ring-Experiment-Where.Row-Number.Rem.10.Eq Can we transform run time with some function of the number of processors and the problem size?

Transforming run time Assume each task takes unit time. Let S be the number of tasks to be done. Let N be the number of processors to do them. Let T be the time required to do them all (run time). So k i = S i /N i is best possible run time on task i, i.e., perfect use of parallelism. T i / k i measures deviation from perfection. The transform we want is R i = (T i N i ) / S i. Runtime restated to be independent of problem size and number of processors.

Lesson 5: Check result is significant Mean Median KOSO 1.61 1.18 KOSO* 1.40 1.03 Median KOSO* is almost perfectly efficient 250 200 CDF(R,KOSO) CDF(R,KOSO*) 150 100 50 50 100 150 Number of trials

Useful terms Cumulative Distribution Function: A "running sum" of all the quantities in the distribution: 7 2 5 3 => 7 9 14 17 250 CDF(R,KOSO) 200 150 100 50 CDF(R,KOSO*) 50 100 150 Number of trials

A statistically significant difference! Mean Standard deviation KOSO 1.61 0.78 KOSO* 1.40 0.7 Two-sample t test: difference between the means probability of this result if the difference between the means were truly zero estimate of the variance of the difference between the means

The two-sample t test Mean Standard deviation KOSO 1.61 0.78 KOSO* 1.40 0.7

The logic of statistical hypothesis testing 1. Assume KOSO = KOSO* 2. Run an experiment to find the sample statistics R koso =1.61, R koso* = 1.4, and = 0.21 3. Find the distribution of under the assumption KOSO = KOSO* 4. Use this distribution to find the probability p of = 0.21 if KOSO = KOSO* 5. If the probability is very low (it is, p<.02) reject KOSO = KOSO* 6. p<.02 is your residual uncertainty that KOSO might equal KOSO* difference between the means probability of this result if the difference between the means were truly zero estimate of the variance of the difference between the means

Useful terms 1. Assume KOSO = KOSO* 2. Run an experiment to get the sample statistics R koso =1.61, R koso* = 1.4, and = 0.21 This is called the null hypothesis (H 0 ) and typically is the inverse of the alternative hypothesis (H 1 ) which is what you want to show. 3. Find the distribution of under the assumption KOSO = KOSO* 4. Use this distribution to find the probability of = 0.21 given H 0 5. If the probability is very low, reject KOSO = KOSO* 6. p is your residual uncertainty This is called the sampling distribution of the statistic under the null hypothesis This is called rejecting the null hypothesis. This p value is the probability of incorrectly rejecting H 0

Useful terms 1. 2. 3. Find the distribution of under the assumption KOSO = KOSO* 4. Use this distribution to find the probability of = 0.21 given H 0 5. 6. the sampling distribution of the statistic. Its standard deviation is called the standard error Statistical tests transform statistics like into standard error (s.e.) units It's easy to find the region of a distribution bounded by k standard error units E.g., 1% of the normal (Gaussian) distribution lies above 1.96 s.e. units.

Conclusion Clarify your claim before you start. Make sure your experiment is capable of evaluating your claim (run pilots). Explore beneath the results to understand what is going on. Design statistical analysis to control unwanted variance. Support your claim by showing null hypothesis is very unlikely.