CTL.SC0x Supply Chain Analytics

Similar documents
Properties and Hypothesis Testing

1 Inferential Methods for Correlation and Regression Analysis

Mathematical Notation Math Introduction to Applied Statistics

Read through these prior to coming to the test and follow them when you take your test.

Correlation Regression

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

Regression, Inference, and Model Building

This is an introductory course in Analysis of Variance and Design of Experiments.

Final Examination Solutions 17/6/2010

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Linear Regression Models

Statistics 511 Additional Materials

Random Variables, Sampling and Estimation

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Common Large/Small Sample Tests 1/55

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Topic 9: Sampling Distributions of Estimators

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Describing the Relation between Two Variables

A statistical method to determine sample size to estimate characteristic value of soil parameters

Chapter 13, Part A Analysis of Variance and Experimental Design

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

11 Correlation and Regression

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Inferential Statistics. Inference Process. Inferential Statistics and Probability a Holistic Approach. Inference Process.

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Topic 9: Sampling Distributions of Estimators

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

Topic 9: Sampling Distributions of Estimators

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Chapter 23: Inferences About Means

Stat 139 Homework 7 Solutions, Fall 2015

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Sample Size Determination (Two or More Samples)

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

- E < p. ˆ p q ˆ E = q ˆ = 1 - p ˆ = sample proportion of x failures in a sample size of n. where. x n sample proportion. population proportion

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Chapter 8: Estimating with Confidence

Frequentist Inference

Lecture 5: Parametric Hypothesis Testing: Comparing Means. GENOME 560, Spring 2016 Doug Fowler, GS

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Stat 200 -Testing Summary Page 1

Class 23. Daniel B. Rowe, Ph.D. Department of Mathematics, Statistics, and Computer Science. Marquette University MATH 1700

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

MA238 Assignment 4 Solutions (part a)

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Math 140 Introductory Statistics

Understanding Samples

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

UCLA STAT 110B Applied Statistics for Engineering and the Sciences

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Chapter 6 Sampling Distributions

Section 9.2. Tests About a Population Proportion 12/17/2014. Carrying Out a Significance Test H A N T. Parameters & Hypothesis

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Chapter If n is odd, the median is the exact middle number If n is even, the median is the average of the two middle numbers

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Simple Linear Regression

ECON 3150/4150, Spring term Lecture 3

April 18, 2017 CONFIDENCE INTERVALS AND HYPOTHESIS TESTING, UNDERGRADUATE MATH 526 STYLE

(all terms are scalars).the minimization is clearer in sum notation:

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

MidtermII Review. Sta Fall Office Hours Wednesday 12:30-2:30pm Watch linear regression videos before lab on Thursday

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

STAC51: Categorical data Analysis

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

Statistics 20: Final Exam Solutions Summer Session 2007

Exam II Covers. STA 291 Lecture 19. Exam II Next Tuesday 5-7pm Memorial Hall (Same place as exam I) Makeup Exam 7:15pm 9:15pm Location CB 234

Computing Confidence Intervals for Sample Data

There is no straightforward approach for choosing the warmup period l.

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Statistical inference: example 1. Inferential Statistics

Chapter 1 (Definitions)

Module 1 Fundamentals in statistics

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

HYPOTHESIS TESTS FOR ONE POPULATION MEAN WORKSHEET MTH 1210, FALL 2018

Dr. Maddah ENMG 617 EM Statistics 11/26/12. Multiple Regression (2) (Chapter 15, Hines)

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

z is the upper tail critical value from the normal distribution

Lecture 11 Simple Linear Regression

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

Output Analysis (2, Chapters 10 &11 Law)

GUIDELINES ON REPRESENTATIVE SAMPLING

Lesson 11: Simple Linear Regression

Chapter 11: Asking and Answering Questions About the Difference of Two Proportions

Statistical Inference About Means and Proportions With Two Populations

Transcription:

CTL.SC0x Supply Chai Aalytics Key Cocepts Documet V1.1 This documet cotais the Key Cocepts documets for week 6, lessos 1 ad 2 withi the SC0x course. These are meat to complemet, ot replace, the lesso videos ad slides. They are iteded to be refereces for you to use goig forward ad are based o the assumptio that you have leared the cocepts ad completed the practice problems. The first draft was created by Dr. Alexis Batema i the Fall of 2016. This is a draft of the material, so please post ay suggestios, correctios, or recommedatios to the Discussio Forum uder the topic thread Key Cocept Documets Improvemets. Thaks, Chris Caplice, Eva Poce ad the SC0x Teachig Commuity Fall 2016 v1 v1.1 Fall 2016 This work is licesed uder a Creative Commos Attributio NoCommercial ShareAlike 4.0 Iteratioal Licese.

Week 6: Buildig Predictive Models Learig Objectives Uderstad how to work with multiple variables. Be aware of data limitatios with size ad represetatio of populatio. Idetify how to test a hypothesis. Review ad apply the steps i the practice of regressio. Be able to aalyze regressio ad recogize issues. Summary of Lesso I this lesso we expaded our tool set of predictive models to iclude ordiary least squares regressio. The lesso equips us with the tools to build, ru ad iterpret a regressio model. I the first lesso, we are itroduced with how to work with multiple variables ad their iteractio. This icludes correlatio ad covariace, which measures how two variables chage together. As we review how to work with multiple variables, it is importat to keep i mid that the data sets supply chai maagers will deal with are largely samples, ot a populatio. This meas that the subset of data must be represetative of the populatio. The later part of the lesso itroduces hypothesis testig, which allows us to aswer ifereces about the data. The secod part of the week tackles liear regressio. Regressio is a very importat practice for supply chai professioal because it allows us to take multiple radom variables ad fid relatioships betwee them. I some ways, regressio becomes more of a art the a sciece. There are four mai steps to regressio: choosig with idepedet variables to iclude, collectig data, ruig the regressio, ad aalyzig the output (the most importat step). Key Cocepts Multiple Radom Variables Most situatios i practice ivolve the use ad iteractio of multiple radom variables or some combiatio of radom variables. We eed to be able to measure the relatioship betwee these RVs as well as uderstad how they iteract. Covariace ad Correlatio Covariace ad correlatio measure a certai kid of depedece betwee variables. If radom variables are positively correlated, higher tha average values of X are likely to occur with higher tha average values of Y. For egatively correlated radom variables, higher tha average values are likely to occur with lower tha average values of Y. It is importat to CTL.SC0x Supply Chai Aalytics 2

remember as the old, but ecessary sayig goes: correlatio does ot equal causality. This meas that you are fidig a mathematical relatioship ot a causal oe. Covariace Cov(X,Y) P(X,Y y i )[( X )(y i Y )] i1 i1 ( X )(y i Y ) Correlatio Coefficiet: is used to stadardize the covariace i order to better iterpret. It is a measure betwee 1 ad +1 that idicates the degree ad directio of the relatioship betwee two radom variables or sets of data. CORR(X,Y) COV(X,Y ) X Y Spreadsheet Fuctios Fuctio Microsoft Excel Google Sheets LibreOffice >Calc Covariace =COVAR(array,array) =COVAR(array,array) =COVAR(array;array) Correlatio =CORREL(array,array) =CORREL(array, array) =CORREL(array;array) Liear Fuctio of Radom Variables A liear relatioship exists betwee X ad Y whe a oe uit chage i X causes Y to chage by a fixed amout, regardless of how large or small X is. Formally, this is: Y = ax + b. The summary statistics of a liear fuctio of a Radom Variable are: Expected value: E[Y] = μ Y = aμ X + b Variace: VAR[Y]= σ 2 Y = a 2 σ 2 X Stadard Deviatio: σ Y = a σ X Sums of Radom Variables IF X ad Y are idepedet radom variables where W = ax + by, the the summary statistics are: Expected value: E[W] = aμ X + bμ Y Variace: VAR[W] = a 2 σ 2 X + b 2 σ 2 Y + 2abCOV(X,Y) = a 2 σ 2 X + b 2 σ 2 Y + 2abσ X σ Y CORR(X,Y) Stadard Deviatio: σ W = VAR[W] These relatios hold for ay distributio of X ad Y. However, if X ad Y are ~N, the W is ~N as well! CTL.SC0x Supply Chai Aalytics 3

Cetral Limit Theorem Cetral limit theorem states that the sample distributio of the mea of ay idepedet radom variable will be ormal or early ormal, if the sample size is large eough. Large eough is based o a few factors oe is accuracy (more sample poits will be required) as well as the shape of the uderlyig populatio. May statisticias suggest that 30, sometimes 40, is cosidered large eough. This is importat because is does t matter what distributios the radom variable follows. Ca be iterpreted as follows: X i,..x are iid with mea= ad stadard deviatio = σ o The sum of the radom variables is S =ΣX i o The mea of the radom variables is X = S / The, if is large (say > 30) o S is Normally distributed with mea = ad stadard deviatio σ o X is Normally distributed with mea = ad stadard deviatio σ/ Iferece Testig Samplig We eed to kow somethig about the sample to make ifereces about the populatio. The iferece is a coclusio reached o the basis of evidece ad reasoig. To make ifereces we eed to ask testable questios such as if the data fits a specific distributio or are two variables correlated? To uderstad these questios ad more we eed to uderstad samplig of a populatio. If samplig is doe correctly, the sample mea should be a estimator of the populatio mea as well as correspodig parameters. Populatio: is the etire set of uits of observatio. Sample: subset of the populatio. Parameter: describes the distributio of radom variable. Radom Sample: is a sample selected from the populatio so that each item is equally likely. Cofidece Itervals Cofidece itervals are used to describe the ucertaity associated with a sample estimate of a populatio parameter. CTL.SC0x Supply Chai Aalytics 4

Calculatig Cofidece Itervals Whe the >30 We ca assume: X ~N(μ,σ/ ) The level c of a cofidece iterval gives the probability that the iterval produced icludes the true value of the parameter. Where z is the correspodig z score correspodig to the area aroud the mea: z=1.65 for =.90, z=1.96 for =.95, z=2.81 for =.995 x zs, x zs For spreadsheets use: z = NORM.S.INV((1+β)/2) Whe 30 The we eed to use the t distributio, which is bell shaped ad symmetric aroud 0. Mea = 0, but Std Dev = (k/k 2) Where k is the degrees of freedom ad, geerally, k= 1 The value of c is a fuctio of β ad k Where c is the correspodig t statistic correspodig to the area aroud the mea. x cs, x cs CTL.SC0x Supply Chai Aalytics 5

For spreadsheets, use: c =T.INV.2T(1 β, k) There are some importat isights for cofidece itervals aroud the mea. There are tradeoffs betwee iterval (l), sample size () ad cofidece (b): Whe is fixed, usig a higher cofidece level b leads to a wider iterval, L. Whe cofidece level is fixed (b), icreasig sample size, leads to smaller iterval, L. Whe both ad cofidece level are fixed, we ca obtai a tighter iterval, L, by reducig the variability (i.e. small s ad s). Whe iterpretig cofidece itervals, a few thigs to keep i mid: Repeatedly takig samples ad fidig cofidece itervals leads to differet itervals each time, But b% of the resultig itervals would cotai the true mea. To costruct a b% cofidece iterval that is withi (+/ ) L of μ, the required sample size is: =z 2 *s 2 / L 2 Hypothesis Testig Hypothesis testig is a method for makig a choice betwee two mutually exclusive ad collectively exhaustive alteratives. I this practice, we make two hypotheses ad oly oe ca be true. Null Hypothesis (H 0 ) ad the Alterative Hypothesis (H 1 ). We test, at a specified sigificace level, to see if we ca Reject the Null hypothesis, or Accept the Null Hypothesis (or more correctly, do ot reject ). Two types of Mistakes i hypothesis testig: Type I: Reject the Null hypothesis whe i fact it is True (Alpha) Type II: Accept the Null hypothesis whe i fact it is False (Beta) We focus o Type I errors whe settig sigificace level (.05,.01) Three possible hypotheses or outcomes to a test Ukow distributio is the same as the kow distributio (Always H 0 ) Ukow distributio is higher tha the kow distributio Ukow distributio is lower tha the kow distributio Chi square test Chi Square test ca be used to measure the goodess of fit ad determie whether the data is distributed ormally. To use a chi square test, you typically will create a bucket of categories, c, cout the expected ad observed (actual) values i each category, ad calculate the chi square statistics ad fid the p value. If the p=value is less tha the level of sigificat, you will the reject the ull hypothesis. Observed Expected 2 2 Expected df c 1 CTL.SC0x Supply Chai Aalytics 6

Spreadsheet Fuctios: Fuctio Microsoft Excel Google Sheets LibreOffice >Calc Returs p value for Chi Square Test =CHISQ.TEST(observed_values, expected_values) =CHITEST(observed_values, expected_values) =CHISQ.TEST(observed_values; expected_values) Ordiary Least Squares Liear Regressio Regressio is a statistical method that allows users to summarize ad study relatioships betwee a depedet (Y) variable ad oe or more idepedet (X) variables. The depedet variable Y is a fuctio of the idepedet variables X. It is importat to keep i mid that variables have differet scales (omial/ordial/ratio). For liear regressio, the depedet variable is always a ratio. The idepedet variables ca be combiatios of the differet umber types. Liear Regressio Model The data (, y i ) are the observed pairs from which we try to estimate the Β coefficiets to fid the best fit. The error term, ε, is the uaccouted or uexplaied portio. Liear Model: y i 0 1 Y i 0 1 i for i 1, 2,... Residuals Because a liear regressio model is ot always appropriate for the data, you should assess the appropriateess of the model by defiig residuals. The differece betwee the observed value of the depedet variable ad predicted value is called the residual. ŷ i b 0 b 1 for i 1,2,... e i y i ŷ i y i b 0 b 1 for i 1,2,... Ordiary Least Squares (OLS) Regressio Ordiary least squares is a method for estimatig the ukow parameters i a liear regressio model. It fids the optimal value of the coefficiets (b 0 ad b 1 ) that miimize the sum of the squares of the errors: 2 e i y i ŷ i y i b 0 b 1 i1 i1 2 i1 2 y b 1 x b 1 i1 ( x)( y i y) i1 ( x) 2 CTL.SC0x Supply Chai Aalytics 7

Multiple Variables These relatioships traslate also to multiple variables. Y i 0 1 x 1i... k x ki i for i 1,2,... E(Y x 1,x 2,..., x k ) 0 1 x 1 2 x 2... k x k StdDev(Y x 1,x 2,..., x k ) y i ŷ i y i b 0 b 1 x 1i... b k x ki 2 e i1 i i1 2 i1 2 Validatig a Model All statistical software packages will provide statistics for evaluatio (ames ad format will vary by package). But the model output typically icludes: model statistics (regressio statistics or summary of fit), aalysis of variace (ANOVA), ad parameter statistics (coefficiet statistics). Overall Fit Overall fit = how much variatio i the depedet variable (y), ca we explai? Total variatio of CPL fid the dispersio aroud the mea. Total Sum of Squares Make estimate for of Y for each x. Error or Residual Sum of Squares TSS (y i y) 2 e i y i ŷ i 2 RSS e i y i ŷ i 2 Model explais % of total variatio of the depedet variables. Coefficiet of Determiatio or Goodess of Fit (R 2 ) R 2 1 RSS TSS 1 y ŷ i i 2 y i y 2 Adjusted R 2 corrects for additioal variables adjr 2 1 RSS 1 TSS k 1 i1 CTL.SC0x Supply Chai Aalytics 8

Idividual Coefficiets Each Idepedet variable (ad b 0 ) will have: A estimate of coefficiet (b 1 ), A stadard error (s bi ) o s e = Stadard error of the model s e y i ŷ i 2 N 2 o s x = Stadard deviatio of the idepedet variables = umber of observatios The t statistic o k = umber of idepedet variables o b i = estimate or coefficiet of idepedet variable Correspodig p value Testig the Slope o We wat to see if there is a liear relatioship, i.e. we wat to see if the slope (b 1 ) is somethig other tha zero. So: H 0 : b 1 = 0 ad H 1 b 1 0 o Cofidece itervals estimate a iterval for the slope parameter. Multi Colliearity, Autocorrelatio ad Heterscedasticity Multi Colliearity is whe two or more variables i a multiple regressio model are highly correlated. The model might have a high R 2 but the explaatory variables might fail the t test. It ca also result i strage results for correlated variables. Autocorrelatio is a characteristics of data i which the correlatio betwee the values of the same variables is based o related objects. It is typically a time series issue. Heterscedasticity is whe the variability of a variable is uequal across the rage of values of a secod variable that predicts it. Some tell tale sigs iclude: observatios are supposed to have the same variace. Examie scatter plots ad look for fa shaped distributios. Refereces Hillier ad Lieberma (2012) Itroductio to Operatios Research, McGraw Hill. Bertsimas ad Freud (2003) Data, Models, ad Decisios: The Fudametals of Maagemet Sciece, Dyamic Ideas. CTL.SC0x Supply Chai Aalytics 9