CTL.SC0x Supply Chai Aalytics Key Cocepts Documet V1.1 This documet cotais the Key Cocepts documets for week 6, lessos 1 ad 2 withi the SC0x course. These are meat to complemet, ot replace, the lesso videos ad slides. They are iteded to be refereces for you to use goig forward ad are based o the assumptio that you have leared the cocepts ad completed the practice problems. The first draft was created by Dr. Alexis Batema i the Fall of 2016. This is a draft of the material, so please post ay suggestios, correctios, or recommedatios to the Discussio Forum uder the topic thread Key Cocept Documets Improvemets. Thaks, Chris Caplice, Eva Poce ad the SC0x Teachig Commuity Fall 2016 v1 v1.1 Fall 2016 This work is licesed uder a Creative Commos Attributio NoCommercial ShareAlike 4.0 Iteratioal Licese.
Week 6: Buildig Predictive Models Learig Objectives Uderstad how to work with multiple variables. Be aware of data limitatios with size ad represetatio of populatio. Idetify how to test a hypothesis. Review ad apply the steps i the practice of regressio. Be able to aalyze regressio ad recogize issues. Summary of Lesso I this lesso we expaded our tool set of predictive models to iclude ordiary least squares regressio. The lesso equips us with the tools to build, ru ad iterpret a regressio model. I the first lesso, we are itroduced with how to work with multiple variables ad their iteractio. This icludes correlatio ad covariace, which measures how two variables chage together. As we review how to work with multiple variables, it is importat to keep i mid that the data sets supply chai maagers will deal with are largely samples, ot a populatio. This meas that the subset of data must be represetative of the populatio. The later part of the lesso itroduces hypothesis testig, which allows us to aswer ifereces about the data. The secod part of the week tackles liear regressio. Regressio is a very importat practice for supply chai professioal because it allows us to take multiple radom variables ad fid relatioships betwee them. I some ways, regressio becomes more of a art the a sciece. There are four mai steps to regressio: choosig with idepedet variables to iclude, collectig data, ruig the regressio, ad aalyzig the output (the most importat step). Key Cocepts Multiple Radom Variables Most situatios i practice ivolve the use ad iteractio of multiple radom variables or some combiatio of radom variables. We eed to be able to measure the relatioship betwee these RVs as well as uderstad how they iteract. Covariace ad Correlatio Covariace ad correlatio measure a certai kid of depedece betwee variables. If radom variables are positively correlated, higher tha average values of X are likely to occur with higher tha average values of Y. For egatively correlated radom variables, higher tha average values are likely to occur with lower tha average values of Y. It is importat to CTL.SC0x Supply Chai Aalytics 2
remember as the old, but ecessary sayig goes: correlatio does ot equal causality. This meas that you are fidig a mathematical relatioship ot a causal oe. Covariace Cov(X,Y) P(X,Y y i )[( X )(y i Y )] i1 i1 ( X )(y i Y ) Correlatio Coefficiet: is used to stadardize the covariace i order to better iterpret. It is a measure betwee 1 ad +1 that idicates the degree ad directio of the relatioship betwee two radom variables or sets of data. CORR(X,Y) COV(X,Y ) X Y Spreadsheet Fuctios Fuctio Microsoft Excel Google Sheets LibreOffice >Calc Covariace =COVAR(array,array) =COVAR(array,array) =COVAR(array;array) Correlatio =CORREL(array,array) =CORREL(array, array) =CORREL(array;array) Liear Fuctio of Radom Variables A liear relatioship exists betwee X ad Y whe a oe uit chage i X causes Y to chage by a fixed amout, regardless of how large or small X is. Formally, this is: Y = ax + b. The summary statistics of a liear fuctio of a Radom Variable are: Expected value: E[Y] = μ Y = aμ X + b Variace: VAR[Y]= σ 2 Y = a 2 σ 2 X Stadard Deviatio: σ Y = a σ X Sums of Radom Variables IF X ad Y are idepedet radom variables where W = ax + by, the the summary statistics are: Expected value: E[W] = aμ X + bμ Y Variace: VAR[W] = a 2 σ 2 X + b 2 σ 2 Y + 2abCOV(X,Y) = a 2 σ 2 X + b 2 σ 2 Y + 2abσ X σ Y CORR(X,Y) Stadard Deviatio: σ W = VAR[W] These relatios hold for ay distributio of X ad Y. However, if X ad Y are ~N, the W is ~N as well! CTL.SC0x Supply Chai Aalytics 3
Cetral Limit Theorem Cetral limit theorem states that the sample distributio of the mea of ay idepedet radom variable will be ormal or early ormal, if the sample size is large eough. Large eough is based o a few factors oe is accuracy (more sample poits will be required) as well as the shape of the uderlyig populatio. May statisticias suggest that 30, sometimes 40, is cosidered large eough. This is importat because is does t matter what distributios the radom variable follows. Ca be iterpreted as follows: X i,..x are iid with mea= ad stadard deviatio = σ o The sum of the radom variables is S =ΣX i o The mea of the radom variables is X = S / The, if is large (say > 30) o S is Normally distributed with mea = ad stadard deviatio σ o X is Normally distributed with mea = ad stadard deviatio σ/ Iferece Testig Samplig We eed to kow somethig about the sample to make ifereces about the populatio. The iferece is a coclusio reached o the basis of evidece ad reasoig. To make ifereces we eed to ask testable questios such as if the data fits a specific distributio or are two variables correlated? To uderstad these questios ad more we eed to uderstad samplig of a populatio. If samplig is doe correctly, the sample mea should be a estimator of the populatio mea as well as correspodig parameters. Populatio: is the etire set of uits of observatio. Sample: subset of the populatio. Parameter: describes the distributio of radom variable. Radom Sample: is a sample selected from the populatio so that each item is equally likely. Cofidece Itervals Cofidece itervals are used to describe the ucertaity associated with a sample estimate of a populatio parameter. CTL.SC0x Supply Chai Aalytics 4
Calculatig Cofidece Itervals Whe the >30 We ca assume: X ~N(μ,σ/ ) The level c of a cofidece iterval gives the probability that the iterval produced icludes the true value of the parameter. Where z is the correspodig z score correspodig to the area aroud the mea: z=1.65 for =.90, z=1.96 for =.95, z=2.81 for =.995 x zs, x zs For spreadsheets use: z = NORM.S.INV((1+β)/2) Whe 30 The we eed to use the t distributio, which is bell shaped ad symmetric aroud 0. Mea = 0, but Std Dev = (k/k 2) Where k is the degrees of freedom ad, geerally, k= 1 The value of c is a fuctio of β ad k Where c is the correspodig t statistic correspodig to the area aroud the mea. x cs, x cs CTL.SC0x Supply Chai Aalytics 5
For spreadsheets, use: c =T.INV.2T(1 β, k) There are some importat isights for cofidece itervals aroud the mea. There are tradeoffs betwee iterval (l), sample size () ad cofidece (b): Whe is fixed, usig a higher cofidece level b leads to a wider iterval, L. Whe cofidece level is fixed (b), icreasig sample size, leads to smaller iterval, L. Whe both ad cofidece level are fixed, we ca obtai a tighter iterval, L, by reducig the variability (i.e. small s ad s). Whe iterpretig cofidece itervals, a few thigs to keep i mid: Repeatedly takig samples ad fidig cofidece itervals leads to differet itervals each time, But b% of the resultig itervals would cotai the true mea. To costruct a b% cofidece iterval that is withi (+/ ) L of μ, the required sample size is: =z 2 *s 2 / L 2 Hypothesis Testig Hypothesis testig is a method for makig a choice betwee two mutually exclusive ad collectively exhaustive alteratives. I this practice, we make two hypotheses ad oly oe ca be true. Null Hypothesis (H 0 ) ad the Alterative Hypothesis (H 1 ). We test, at a specified sigificace level, to see if we ca Reject the Null hypothesis, or Accept the Null Hypothesis (or more correctly, do ot reject ). Two types of Mistakes i hypothesis testig: Type I: Reject the Null hypothesis whe i fact it is True (Alpha) Type II: Accept the Null hypothesis whe i fact it is False (Beta) We focus o Type I errors whe settig sigificace level (.05,.01) Three possible hypotheses or outcomes to a test Ukow distributio is the same as the kow distributio (Always H 0 ) Ukow distributio is higher tha the kow distributio Ukow distributio is lower tha the kow distributio Chi square test Chi Square test ca be used to measure the goodess of fit ad determie whether the data is distributed ormally. To use a chi square test, you typically will create a bucket of categories, c, cout the expected ad observed (actual) values i each category, ad calculate the chi square statistics ad fid the p value. If the p=value is less tha the level of sigificat, you will the reject the ull hypothesis. Observed Expected 2 2 Expected df c 1 CTL.SC0x Supply Chai Aalytics 6
Spreadsheet Fuctios: Fuctio Microsoft Excel Google Sheets LibreOffice >Calc Returs p value for Chi Square Test =CHISQ.TEST(observed_values, expected_values) =CHITEST(observed_values, expected_values) =CHISQ.TEST(observed_values; expected_values) Ordiary Least Squares Liear Regressio Regressio is a statistical method that allows users to summarize ad study relatioships betwee a depedet (Y) variable ad oe or more idepedet (X) variables. The depedet variable Y is a fuctio of the idepedet variables X. It is importat to keep i mid that variables have differet scales (omial/ordial/ratio). For liear regressio, the depedet variable is always a ratio. The idepedet variables ca be combiatios of the differet umber types. Liear Regressio Model The data (, y i ) are the observed pairs from which we try to estimate the Β coefficiets to fid the best fit. The error term, ε, is the uaccouted or uexplaied portio. Liear Model: y i 0 1 Y i 0 1 i for i 1, 2,... Residuals Because a liear regressio model is ot always appropriate for the data, you should assess the appropriateess of the model by defiig residuals. The differece betwee the observed value of the depedet variable ad predicted value is called the residual. ŷ i b 0 b 1 for i 1,2,... e i y i ŷ i y i b 0 b 1 for i 1,2,... Ordiary Least Squares (OLS) Regressio Ordiary least squares is a method for estimatig the ukow parameters i a liear regressio model. It fids the optimal value of the coefficiets (b 0 ad b 1 ) that miimize the sum of the squares of the errors: 2 e i y i ŷ i y i b 0 b 1 i1 i1 2 i1 2 y b 1 x b 1 i1 ( x)( y i y) i1 ( x) 2 CTL.SC0x Supply Chai Aalytics 7
Multiple Variables These relatioships traslate also to multiple variables. Y i 0 1 x 1i... k x ki i for i 1,2,... E(Y x 1,x 2,..., x k ) 0 1 x 1 2 x 2... k x k StdDev(Y x 1,x 2,..., x k ) y i ŷ i y i b 0 b 1 x 1i... b k x ki 2 e i1 i i1 2 i1 2 Validatig a Model All statistical software packages will provide statistics for evaluatio (ames ad format will vary by package). But the model output typically icludes: model statistics (regressio statistics or summary of fit), aalysis of variace (ANOVA), ad parameter statistics (coefficiet statistics). Overall Fit Overall fit = how much variatio i the depedet variable (y), ca we explai? Total variatio of CPL fid the dispersio aroud the mea. Total Sum of Squares Make estimate for of Y for each x. Error or Residual Sum of Squares TSS (y i y) 2 e i y i ŷ i 2 RSS e i y i ŷ i 2 Model explais % of total variatio of the depedet variables. Coefficiet of Determiatio or Goodess of Fit (R 2 ) R 2 1 RSS TSS 1 y ŷ i i 2 y i y 2 Adjusted R 2 corrects for additioal variables adjr 2 1 RSS 1 TSS k 1 i1 CTL.SC0x Supply Chai Aalytics 8
Idividual Coefficiets Each Idepedet variable (ad b 0 ) will have: A estimate of coefficiet (b 1 ), A stadard error (s bi ) o s e = Stadard error of the model s e y i ŷ i 2 N 2 o s x = Stadard deviatio of the idepedet variables = umber of observatios The t statistic o k = umber of idepedet variables o b i = estimate or coefficiet of idepedet variable Correspodig p value Testig the Slope o We wat to see if there is a liear relatioship, i.e. we wat to see if the slope (b 1 ) is somethig other tha zero. So: H 0 : b 1 = 0 ad H 1 b 1 0 o Cofidece itervals estimate a iterval for the slope parameter. Multi Colliearity, Autocorrelatio ad Heterscedasticity Multi Colliearity is whe two or more variables i a multiple regressio model are highly correlated. The model might have a high R 2 but the explaatory variables might fail the t test. It ca also result i strage results for correlated variables. Autocorrelatio is a characteristics of data i which the correlatio betwee the values of the same variables is based o related objects. It is typically a time series issue. Heterscedasticity is whe the variability of a variable is uequal across the rage of values of a secod variable that predicts it. Some tell tale sigs iclude: observatios are supposed to have the same variace. Examie scatter plots ad look for fa shaped distributios. Refereces Hillier ad Lieberma (2012) Itroductio to Operatios Research, McGraw Hill. Bertsimas ad Freud (2003) Data, Models, ad Decisios: The Fudametals of Maagemet Sciece, Dyamic Ideas. CTL.SC0x Supply Chai Aalytics 9