Applied regression. Dr. Nitiphong Songsrirote

Size: px

Start display at page:

Download "Applied regression. Dr. Nitiphong Songsrirote"

Martha Rich
5 years ago
Views:

1 Appled regresson Dr. Ntphong Songsrrote 553

2 of 84 of 84 Page of 7 Avery robust statstcalmethodologythat tradtonally hasusedexstngrelatonshps exstng betweenvarablestoallowpredctonofthe valuesofonevarablefromoneormoreothers Examples Salescanbepredctedusngadvertsngexpendture g p Performanceonapttudetestscanbeusedtopredct jobperformance GPAafterfrstyearnPhDprogramcanbepredcted fromgmatscore 3 of 84 4 of 84 In the late 800s, Sr Francs Galton observed that heghts of chldren of both short and tall parents appeared to come closer to the mean of the group: extraordnary parents gave brth to more ordnary chldren Galton consdered ths to be a regresson to medocrty Today we understand that ths effect s due to the presence of other heght predctors: chldren wth parents of extraordnary heght ht may be ordnary n other heght ht determnants, such as nutrton

3 5 of 84 6 of 84 Page 3 of Sales $ S Unts Sold al GPA Actu y = x R = = sthedependent orcrteron varable sthendependent ndependent orpredctor varable Valueofexactlypredctedby No error ofpredctonexsts aperfect relatonshpbetweenand d Entrance Test Score Scatterdagramshowngrelatonshpbetweentwovarables,Score &GPA = doesnotperfectlypredct predct ourgpaattheendofthefrstyearcannotbeexactlypredctedbyyourscoreon anentranceexam RelatonshpbetweenScore andgpa appearstobelnear 7 of 84 8 of 84 Crteronandpredctorvarableswere quanttatve&measuredat,atleast,the t t t th nterval level Alnearrelatonshpmustexstbetween crteronandpredctorvarables Doestheexstenceofnonlneartynthe ndependentvarablemakethemodelanon the model a non lnearregressonmodel? Abltyto partal outtheeffectsofspecfc predctorvarablesonthecrteronnstuatons nwhchpredctorsarenotorthogonaltoone another Canestablshtheunquecontrbutonsofeach h b h predctortovarancenthecrteron Allowsdentfcatonof of spurous relatonshps Studysystemsofcausalrelatonshps=f(C,D,E, etc. Expermental&Nonexpermentaldesgns CausalModelng,CovaranceStructureModelng (C&C pp. -0

4 9 of 84 0 of 84 Page 4 of 7 Formofthedatamorethanquanttatve, ntervalorrato rato Datacanrangefromnomnaltorato Nomnallyscaledpredctorsbggestdeparture s departure Nomnalvarablestradtonalassessedwthnthecontextof ANOVA,ANCOVA(by groupng thevalues CanmxscaletypesnMRC Shapeofrelatonshpneednotbelnear Predctorsmaybelnearornonlnear l Transformatonsofnonlneardatapossbleto producelneartyrequredfortheregressonmodel requred the regresson s Score d-s Motvaton Scatter dagram showng relatonshp between two varables, Motvaton & d-s Score Motvaton ( does not perfectly predct d-s Score ( Relatonshp between Motvaton and d-s Score appears to be curvlnear of 84 of 84 Investgate condtonalrelatonshps Interactonsbetweenpredctorvarablesor t t t bl groupsofvarables ExtendsANOVA,ANCOVA Notlmtedtonteractonsbetweennomnallyscaled varables Canassessnteractonsbetweenpredctorsmeasuredat vrtuallyanylevel l l Extremelycommonnbehavoralscences. 0 Populaton Regresson Functon =valueofobservedresponseon th tral; 0 and areparameters; =valueofpredctoron th tral(aconstant; sarandomerrorterm E{ }=0(expectedvalueoferrortermsszero { }= (varanceoferrortermssconstant {, j }= 0forall,j; j(errortermsdonotcovary.e.arenot are correlated =,.n

5 3 of 84 4 of 84 Page 5 of 7 E{ } Each conssts of two parts: ( a constant term predcted by the regresson equaton; and, ( a random error term unque to. The error term makes a random varable. } 0 E{ } E{} 0 The regresson functon predcts the expected value of for a gven = the change n the mean of the probablty dstrbuton for for each unt ncrease n. The regresson functon predcts the expected value of for a gven. 0 = -ntercept; the mean of the probablty dstrbuton for when = 0. Assumes scope of model ncludes = 0. Values of come from a probablty dstrbuton wth mean of E{ } = of 84 6 of 84 Score GPA Predcton of GPA at end of frst year based on GMAT GMAT Score ear GPA Frst e t ear GPA Frst GMAT Score E { }

6 7 of 84 8 of 84 Page 6 of 7 RegressonFunctonspecfesrelatonshp betweenpredctorandresponsevarablesna response n a populaton Valuesofregressonparameters( 0 and areestmatedfromsampledatadrawnfrom thepopulaton.dataareobtanedva: Observaton Expermentaton Survey Technqueemployedtoproduceestmatesb dt d t t 0 andb for 0 and,respectvely. Fndthosevaluesofb 0 andb thatmnmze thesumofallsquarederrorterms( squa ed e o te s n Q ( 0 The estmators of 0 and are the values of b 0 and b that mnmze Qfor a set of sample observatons. 9 of 84 0 of 84 Assume from GPA example that: b0 = -.5 b = Predcted GPA GMAT GPA Error Sum = Q Q n n ( 0 ( b0 b Q 3.64

7 of 84 of 84 Page 7 of 7 GP PA GMAT b 0 = -.5; b =.0. Q = Looks pretty good! Seems qute reasonable, but. are there other values of b 0 and db that t provde smaller Q s for the sample data? GPA GMAT b 0 = -.70; b =.0084 Q = 3.4. Looks even better! Ths s the least squares soluton that mnmzes Q. No other values of fb 0 and db wll provde a smaller value of Q. Lets see a small macro. 3 of 84 4 of 84 GMAT Score GPA Predcted GPA Error Squared Error ( ( ( Terms Terms Q = 3.4 The soluton that mnmzes Q: b0 = b = Numercal Search Procedures Analytc Procedures

8 5 of 84 6 of 84 Page 8 of 7 UnconstranedOptmzatonAlgorthms Systematcallysearchforvaluesofb t hf l fb 0 andb db thatt mnmzeqforagvensetofdata Spreadsheetsolutonpossble. ExcelExampleUsngGMATdata b0 = -5 b = 0.05 GMAT Score ( GPA ( Predcted GPA ( Error Terms Squared Error Terms ˆ e ˆ b 0 b e ( ( ˆ n Q = 86.4 Q 7 of 84 8 of 84 Drectsolutonforvaluesofb of 0 andb thatmnmzeq Usngcalculuscanfndsetofsmultaneousequatons, the normalequatons Normalequatonsforb 0 andb are: b ( ( ( nb0 b b b 0 b 0 b b n

9 9 of of 84 Page 9 of GMAT Score GPA ( ( ( ( Total Mean of 84 3 of 84 ( ( 766 ( 9, 00 b 766 9, b0 b (500.70

33 of 84 34 of 84 Page 0 of 7 ˆ b 0 b ˆ s the estmate of E{}, the mean response, when the level of the predctor s.

34 GMAT Predcted Score GPA GPA ( ( ( ˆ 390.90.58 40 0.0 74.74 430.60.9 430.40.9 450.50.08 460.80.6 470 3.00.5 470.80.5 Note dfference between observed value and ftted 470.

10 33 of of 84 Page 0 of 7 ˆ b 0 b ˆ s the estmate of E{}, the mean response, when the level of the predctor s. b 0 and b are estmates of 0 and, respectvely ˆ s the ftted value for the th case (.e. when = 35 of of 84 b 0 = -.70; b =.0084 ˆ ˆ ( GMAT Predcted Score GPA GPA ( ( ( ˆ Note dfference between observed value and ftted values..

11 37 of 84 GMAT Predcted Squared Score GPA GPA Error Error ( ( ˆ ( Terms Terms ˆ b0 b e ˆ e ˆ ( ˆ e ˆ e (resdual s the known devaton between the observed value and the ftted value (model error term s the devaton between the observed value and the unknown true regresson lne. e s an estmate of 38 of 84 Page of 7 39 of of 84 Follow from propertes of normal equatons... GMAT Score ( ˆe GPA ( ˆ e e e Sums Average 500.5

12 4 of 84 4 of 84 Page of 7 Unbased estmator of s MSE MSE SSE df ˆ e n n 43 of of 84 MSE e n 8 ANOVA b Extenson of regresson model requred to: make nferences about estmators; conduct sgnfcance tests; construct confdence ntervals around estmates Model Regresson Resdual Total a. Predctors: (Constant, GMAT b. Dependent Varable: GPA Sum of Mean Squares df Square F Sg a

45 of 84 46 of 84 Page 3 of 7 0 =valueofobservedresponseon th tral; 0 and areparameters; =valueofpredctoron th tral(aconstant; sarandomerrorterm E{ }=0(expectedvalueoferrortermsszero { }= }=

n =valueofobservedresponseon th tral;ndependentnormal randomvarables E{ } }= 0 + Varanceof sarandomerrortermn(0, E{ }=0(expectedvalueoferrortermsszero { }= }= (varanceoferrortermssconstant error s

13 45 of of 84 Page 3 of 7 0 =valueofobservedresponseon th tral; 0 and areparameters; =valueofpredctoron th tral(aconstant; sarandomerrorterm E{ }=0(expectedvalueoferrortermsszero { }= }= (varanceoferrortermssconstant error s constant {, j }=0forall,j; j(errortermsdonotcovary.e.arenot correlated =,.n 0 0 and areparameters; = on th valueofpredctoron tral(aconstant; a =,.n =valueofobservedresponseon th tral;ndependentnormal randomvarables E{ } }= 0 + Varanceof sarandomerrortermn(0, E{ }=0(expectedvalueoferrortermsszero { }= }= (varanceoferrortermssconstant error s constant {, j }=0forall,j; j(errortermsarendependent;donotcovary;are notcorrelated Errortermsarenormallydstrbuted Same as Regresson Model, except that error terms are now assumed to be normally dstrbuted of of 84 Maxmum Lkelhood Estmaton Requres functonal form of probablty dstrbuton of random error terms. Provdes estmates of requred parameters that are most consstent wth the sample data. In case of smple lnear regresson, the MLE estmators for b 0 and b are BLUE. The MLE estmator for s based but works out OK when sample sze s large. Ch. Inferences n Smple Lnear Regresson

14 49 of of 84 Page 4 of and areparameters; =valueofpredctoron th tral(aconstant; =,.n =valueofobservedresponseon th tral;ndependentnormalrandom varables E{ }= 0 + Varanceof sarandomerrortermn(0, E{ }=0(expectedvalueoferrortermsszero { }= (varanceoferrortermssconstant {, j }=0forall,j; j(errortermsarendependent;donotcovary;arenot correlated Errortermsarenormallydstrbuted 5 of 84 5 of 84 Null Hypothess Alternatve Hypothess Usual nference about Ho: 0 Ha: 0 H 0 : Slope of the regresson lne s 0; there s no lnear relatonshp between and. Slopeoftheregressonlnes0; l Theresnolnearrelatonshpbetween and; Regressonlneshorzontal Meansofprobabltydstrbutonsforall areequal: Probabltydstrbutonsofall of all aredentcal E{} 0 (0 0 Ho: 0

53 of 84 54 of 84 Page 5 of 7 0 b b s{b } s{b } Studentzedteststatstc(ssestmated Dstrbutedast t n s{b } { MSE s { b }

00 0.60 30.0000 500.00 0.36 480.30-0.00 00-0.0 4.0000 400.00 0.0404 470 3.00-30.00 0.50-5.0000 900.00 0.5 390.90-0.00-0.60 66.

60 0.00 0.0.0000 400.00 0.0 470.80-30.00 0.30-9.0000 900.00 0.09 430.60-70.00-0.90 63.0000 4900.00 0.8 490.00-0.00-0.50 5.

00 0.49 460.80-40.00-0.70 8.0000 600.00 0.49 430.40-70.00 -.0 77.0000 4900.00. 500.00 0.00-0.50 0.0000 0.00 0.5 590 3.80 90.

15 53 of of 84 Page 5 of 7 0 b b s{b } s{b } Studentzedteststatstc(ssestmated Dstrbutedast t n s{b } { MSE s { b } Pontestmators {b } sanunbased estmatorofof {b } s{b } = s b } { 55 of of GMAT Score GPA ( ( ( ( Total Mean 500.5

57 of 84 58 of 84 Page 6 of 7 GMAT Score ( ˆe GPA ( ˆ e e e 390.90.576 0.339 0.049 6.336 0.50533 40.00.744 0.4559 0.079 86.938 0.79595 430.60.90-0.30 0.0974-34.75-0.5966 430.40.90-0.50 0.6-0.75-0.97903 450.

030 0.000-5.358-0.07459 490.00.460-0.460 0.730-03.88 -.00499 500.30.5000-0.000 0.0400-99.988-0.4999 500.00.5000-0.5000 0.500-49.983 983 -.499 50.60.6679-0.0679 0.0046-35.339-0.88 540.90.8359 0.064 0.

16 57 of of 84 Page 6 of 7 GMAT Score ( ˆe GPA ( ˆ e e e Sums Average } MSE s{ b MSE.89 s { b }.004 9,00 59 of of 84 Ho: 0 Ha: 0 ControlrskofTypeIerrorat =.05 Teststatstcststatstc t* t b s{ b } * / ; n conclude Ho If t * t, Ho / ; n conclude not Ho If t * t, t / ; n t (. 975;8.0

17 6 of 84 6 of 84 Page 7 of 7 b =.0084;S{b }=.004 t*=.0084/.004=5.83 t(.975,8=.0(crtcalt t*>t,thereforenotho The null hypothess must be rejected... Model (Constant GMAT a. Dependent Varable: GPA Unstandardzed Coeffcents Coeffcents a Standardzed Coeffcents B Std. Error Beta t Sg E Computes probablty of two-taled t drectly -- much better! 63 of of 84 Assume for GMAT example that we thnk that the relatonshp between GMAT and GPA should always be postve... Null ll&alt Alternatve t Hypotheses Ho: 0 Ha: 0 t * t ; n conclude Ho * t ; n conclude ld not Ho If, If t,

65 of 84 66 of 84 Page 8 of 7 t*=5.83(sameasbefore tsnowsmaller smaller all5%snone tal,ratherthanspreadacrosstwo tals t=.

ANOVAparttonsthesumofsquare(SSn thecrteronvarablentotwoparts: nto two SSthatcanbeattrbutedtothepredctor;and, ErrorSS

18 65 of of 84 Page 8 of 7 t*=5.83(sameasbefore tsnowsmaller smaller all5%snone tal,ratherthanspreadacrosstwo tals t=.734 t*>t,rejectnotho b probablysnotlessthanorequalto0 less than or equal to Mayassumeb spostve 67 of of 84 ANOVAparttonsthesumofsquare(SSn thecrteronvarablentotwoparts: nto two SSthatcanbeattrbutedtothepredctor;and, ErrorSS sumofsquaresunquetothecrteron TotalSSncrteronsSSTO SSattrbutedtopredctorsSSR to s SSR ErrororunqueSSsSSE ˆ E SS SSE ˆ SSTO SSR SSE = ˆ + ˆ ˆ ˆ ˆ b b 0

19 69 of of 84 Page 9 of 7 Source SS df MS F* p SSR = Regresson MSR = MSR/ ˆ SSR/ MSE Error SSE = ˆ n- SSTO = Total n- MSE = SSE/(n- E{MSR} = E{MSE} =.e. an unbased estmator of the error varance.. If = 0, MSR and MSE about same sze & F* wll be small... ˆ ˆ ˆ ˆ Totals Averages.50 7 of 84 7 of 84 Source SS df MS F* p Regresson Error Model Regresson Resdual Total a. Predctors: (Constant, t GMAT b. Dependent Varable: GPA ANOVA b Sum of Squares df Mean Square F Sg a Total

73 of 84 74 of 84 Page 0 of 7 AppropratetestsF Anuppertaltest test

IfF*>F(;,n,concludenotHo ForGMATexample F*=34.005,F(.95;,8=4.

75 of 84 76 of 84 Fta fullmodel tothedataandobtansse(f the and obtan SSE(F= ˆ

20 73 of of 84 Page 0 of 7 AppropratetestsF Anuppertaltest test F*sdstrbutedasF(;,n Ho: =0; 0;Ha: 0 Decsonrule IfF* F(; F( ;,n n,concludeho IfF*>F(;,n,concludenotHo ForGMATexample F*=34.005,F(.95;,8=4.4 ConcludenotHonot Insmpleregresson(.e.asnglepredctor varablesemployedforagvens employed a F*=(t* tstwotaled 75 of of 84 Fta fullmodel tothedataandobtansse(f the and obtan SSE(F= ˆ SmplytheSSEobtanedfromfttngastandard SSE ftt t d d regressonlnetothedata: = Fta reducedmodel tothedataandobtan the and obtan SSE(R ConsderHo usuallyho: =0 ModelwhenHoholdssthereducedmodel When =0,modelreducesto to = 0 + I Becausebestestmatorof 0 s,sse(r= SSE(R=SSTO SSTO

21 77 of of 84 Page of 7 Snce SSE(R = SSTO, df=n- F* If F* F( ; df If F* F( ;df Snce SSE(F = SSE, df=n- SSE ( R SSE( F df R df F SSE( F df F R R df,df,conclude Ho F f df,df,conclude not Ho F f r sthecoeffcentofdetermnaton r =SSR/SSTO= = SSE/SSTO 0 r r sthe proportonofvarancenthecrteronof n the crteron assocatedwththeuseofthepredctor Whenallobservatonsfalldrectlyonregressonlne, on regresson lne predctorperfectlyexplansallvaratonnthe crteronandr = Whenregressonlneshorzontal(b =0,SSE=SSTO andr =0(Caveat:Whathappenswhenlnes horzontalbutallpontsfallont? t ll t ll t? 79 of of 84 ANOVA b Model Regresson Resdual Total a. Predctors: (Constant, GMAT b. Dependent Varable: GPA Sum of Mean Squares df Square F Sg a r = SSR/SSTO = 6.434/9.84 =.654 Measures of Strength of Assocaton Model Model Summary Std. Error Adjusted of the R R Square R Square Estmate.809 a a. Predctors: (Constant, GMAT (The coeffcent of correlaton

22 8 of 84 8 of 84 Page of 7 Correlaton Coeffcent Pearson s Product-Moment Correlaton r xy = Correlaton between two contnuous varables measured at least at nterval level - r xy + Actually, sd sd 0 n (prove ths as an exercse 4 Pearson sproductmomentcorrelaton P tc t (contnued r= r Correlatonbetweentwocontnuousvarables two measuredatleastatntervallevel Unlker,doesnothaveaclear a clearcut nterpretaton Usedextensvelynbehavoralresearch l h Inflatesapparentrelatonshpbetweenand 83 of of 84 When data are n ther orgnal metrc, b r s s When data are standardzed, zˆ y s x y rz x b r Hghrorr maynotmplystrongpredctve capablty r sashghas.9(r =.8canstllhavewdeconfdence ntervalsfortheestmate Alwayscomputeconfdenceorpredctonntervals Hghrorr alwayssuggestsregressonlnesa g goodft Onlyfrelatonshpslnear. Canstllgetrelatvelyhghr,r frelatonshps curvlnear

23 85 of of 84 Formulae for r Page 3 of 7 Lowrorr alwayssuggeststhatandare notrelated,orareweaklyrelated or are Onlyfrelatonshpslnear Canstllgetverylowr,r frelatonshps curvlnear Transformng,,orbothprortoconstructng regressonmodelmymprovetheft(latertopc p When and are standardzed Z xz r y xy n When and are n raw score form (non- standardzed. r xy / Formulae for r Is r xy a least squares estmator or an MLE estmator? Is ths estmator unbased? What s t an estmator of? Isn t ths the same as r from a smple lnear regresson model? Pont Bseral r (one varable dchotomous r pb 87 of 84 0 sd pq Formulae for r Ph Coeffcent (Both and dchotomous j j j j r j j j 88 of 84

24 89 of 84 Inferences on Correlaton Coeffcents Bvarate Normal Populaton Interpretaton I of s mportant 90 of 84 Interval Estmaton of Page 4 of 7 The Fsher z transformaton Testng the H 0 : = 0 (Relate ths to Smple Lnear Regresson! If H * 0 holds, then t gven below ~ t n- r n t r Samplng p gdstrbuton of r s complcated when 0 Cannot use t! If n 5 then z ~ N 0, z where, E ( z E log When n 5, r r log 9 of 84 9 of 84 Estmaton of (Contnued. z Then the CI for s, n 3 z z ( / z We have to retransform back to n order to get ts CI. See KNN for testng hypotheses about ndependent samples from two bvarate normal populatons Tanh(arctanh(r xy -Z Sqrt(n-3<Tanh(arctanh(r xy +Z / /Sqrt(n-3 What f populatons are not Normal? Resort to non-parametrc approach The famous Spearman Rank Correlaton coeffcent, R R R R rs R R R R If there are no tes n the ranks then we can use the more commonly found approxmaton, r 6 d s n n

25 Hypothess Test for Populaton Correlaton Coeffcent H 0 : No assocaton between and H a : There s assocaton between and Samplng dstrbuton b t of r s s avalable n tables and s not too complcated However, when n >0 then we can use, r s n t as n the Normal case r s 93 of 84 Spearman s rank correlaton coeffcent ts also used to test for heteroscedastcty 94 of 84 KNNCh.3 DagnostcsandRemedalMeasures Page 5 of 7 95 of of 84 DotPlots SequencePlots StemandLeafPlots Essentallytocheckforoutlyngobservatonswhchwll beusefulnlaterdagnoss. later WhyLookattheResduals? Detectnonlneartyofregressonfuncton of regresson functon DetectHeteroscedastcty(=lackofconstantvarance Autocorrelaton Outlers Nonnormalty Importantpredctorvarablesleftout? Regresson Model Assumptons: Errors are Independent (Have Zero Covarance Errors have Constant Varance Errors are Normally Dstrbuted

26 97 of of 84 Page 6 of 7 Dagnostcs for Resduals Dagnostcs for Resduals Detect non-lnearty of regresson functon Heteroscedastcty Auto-correlaton Outlers Non-normalty Important predctor varables left out? PLOT OF RESIDUALS. aganst predctor (f only. (Absolute or Sqd. Resdual aganst predctor 3. aganst ftted values (for many 4. aganst ttme 5. aganst omtted predctor varables 6. Box plot 7. Normal probablty plot Approxmate expected value of k th smallest resdual : Normal probablty blt plot k MSE z n of of 84 Tests nvolvng Resduals The Correlaton test for Normalty H 0 : The resduals are normal H A : The resduals are not normal Correlaton l between e ( (s and ther h expected values under normalty. Use Table B.6 B6 Observed coeff. of correlaton should be at least as large as table value for a gven level l of sgnfcance. Tests nvolvng Resduals Other tests for Normalty H 0 : The resduals are normal H A : The resduals are not normal Anderson-Darlng (very powerful, may be used for small sets, n<5 Ryan-Joner Shapro Shapro-Wlk Kolmogorov-Smrov

27 0 of 84 Tests nvolvng Resduals The Correlaton test for Normalty H 0 : The resduals are normal H A : The resduals are not normal Correlaton l between e ( (s and ther h expected values under normalty. Use Table B.6 B6 Observed coeff. of correlaton should be at least as large as table value for a gven level l of sgnfcance. 0 of 84 Tests nvolvng Resduals (Constancy of Error Varance The Modfed Levene Test Parttons the ndependent varable nto two groups (Hgh values and low values, then tests the null H 0 : The groups have equal varances Smlar to a pooled varance t-test test for dfference n two means of ndependent samples. It s robust to departures from normalty or error terms Large sample sze essental so that dependences of error terms on each other can be neglected Uses group medan nstead of the mean (Why? Page 7 of 7 * d d t L Tests nvolvng Resduals s n n where, d ~ e e and d Now, the d (Constancy of Error Varance and d The Modfed Levne Test e ( n s ( n s n n e~ on these two sets of data ponts. s 03 of 84 are the data ponts,.e the t - test s based Read Comments on page 8 and go thru the Breusch-Pagan test on page of 84 Acomparsonof of FullModel sumof squareserrorand LackofFt sumof squares. Forbestresults,requresrepeat observatonsat,atleastonelevel. t tl t l l Fullmodel: j = j + j ( j =meanresponse when= j Reducedmodel: j = 0 + j + j (Why Reduced?

28 05 of of 84 Page 8 of 7 Overvew of some Remedal Measures SSE(Full=SSPE= j j j (Labeled PureError snceunbasedestmatoroftrueerror b d t t varance.see3.3and3.3,page3 SSLF=SSE(ReducedSSPE,(whereSSE(Reduced=SSE fromordnaryleastsquaresregressonmodel SSLF TestStatstc: (whats p? * c p F SSPE n c Be sure to compare the ANOVA table on page 6 wth holsanova table. The Problem: Smple Lnear Regresson s not approprate. The soluton:. Abandon the model ( Eagle to Hawk; abort msson and return to base.. Remedy the stuaton: If Non-ndependent error terms then work wth a model that calls for correlated error terms (Ch. If Heteroscedastcty then use WLS method to estmate parameters (Ch. 0 or use transformatons of data. If scatter plot ndcates non-lnearty, then ether use non-lnear regresson functon (Ch.7 or transform to lnear. NET: We wll look at one such powerful transformaton t method. 07 of 84 The Box-Cox Transformaton Method 08 of 84 The Box-Cox Transformaton Method The famly of power transforms on s gven as: '= The famly easly ncludes smple transforms such as the square root, squared etc. By defnton, when then '=log e When the response varable s so transformed, the normal error regresson model becomes: We would lke to determne the best value of ethod : Maxmum lkelhood estmaton Max L n exp n , 0,, R ethod : Numercal Search Step : Set a value of. Step : Standardze d the observatons If then: W =K ( If then: W =K (log e n / n where, K and K K Step 3: Now regress the set W on the set. Step 4: Note the correspondng SSE. Step 5: Change and repeat steps to 4 untl lowest SSE s obtaned. Let s try both ths method wth the GMAT data. What should we get as the best

29 09 of 84 0 of 84 Page 9 of 7 Confdencentervalsareusedforasngle df l parameter,confdenceregonsforatwoormore parameters Theregonfor( 0, defnesasetoflnes Snce 0 and are(jontlynormal,thenatural confdenceregonsanellpse KNNdorectangles(KNN4. KNNCh.4 SmultaneousInferencesandOtherTopcs of 84 of 84 Wewanttheprobabltythatboth ntervalsare correcttobe(atleast.95 Bascdeasanerrorbudget( =.05 Spendhalfon 0 (.05andhalfon (.05 Weuse =.05forthe 0 CI(97.5%CI and =.05forthe CI(97.5%CI CI Soweuse b ±t * s(b b 0 ±t * s(b 0 wheret * =t(.9875,n 975,.9875= (.05/(*

30 3 of 84 4 of 84 Page 30 of 7 Notewestartwtha5%errorbudgetandwe havetwontervalssowegve so we.5%toeach Eachntervalhastwoends h lh d soweagandvdeby d So,.9875= (.05/(* LetthetwontervalsbeI andi Wewllusecor(=correctfthenterval contansthetrueparametervalue,nc (=ncorrect ncorrectfnotf not 5 of 84 6 of 84 P(bothcor=P(atleastonenc P(atleastonenc =P(I nc+p(i ncp(bothnc leqp(i nc+p(i nc SoP(bothcor geq(p(i nc+p(i nc P(bothcorgeq(P(I nc+p(i nc Sofweuse.05/foreachnterval, for each nterval (P(I nc+p(i nc=.05=.95 SoP(bothcorsatleast.95 l Wewllusethsdeawhenwedomultple comparsonsnanova

31 7 of 84 8 of 84 Page 3 of < Smultaneousestmatonforall h,use WorkngHotellng(KNN.6 g( E( h (hat± Ws(E( h (hat wherew =F(;,n Forsmultaneousestmatonforafew(g h, usebonferron E( h (hat± Bs(E( h (hat whereb=t(/(g, /(g,n 9 of 84 0 of 84 Smultaneouspredctonforafew(g h, usebonferron h (hat± Bs( h (hat whereb=t(/(g,n /(g OrScheffe h (hat± Ss( h (hat wheres =gf(;g,n ;g, = + HowtosettupnyourStatsoftware: tt t Check ConstantsZero (Excel Uncheck FtIntercept noptions(minitab Uncheck IncludeConst.nEq. noptions (SPSS NOINT optonnprocreg(sas Generallynotnot agooddea Problemswthr andotherstatstcs Seecautons,KNNp63

32 of 84 of 84 Page 3 of 7 For,thssusuallynotaproblem For,wecangetbasedestmatorsofour regressonparameters SeeKNN4.5,pp6466 Sometmescalledcalbraton Gven Gven h,predctthecorrespondngvalueof the correspondng of, h (hat Solvethefttedequatonfor h h(hat=( h b 0/b,b neq0 ApproxmateCIcanbegven,seeKNN,p67 3 of 84 4 of 84 Lookattheformulasforthevarancesofthe estmatorsofnterestof nterest Usuallywefnd( (bar na denomnator Sowewanttospreadoutthevaluesof ReadKNN4.to4.6,readproblemsonpp 4 pp 775 Nextclasswewlldoallofthswthvectors wll all of ths wth andmatrcessothatwecangeneralzeto multpleregresson l IfyouarerustynLnearAlgebra: REVIEWKNN5.to5.7

33 5 of 84 Appled Regresson Analyss 6 of 84 Page 33 of 7 KNN Ch. 5 DefntonofaMatrx: Amatrxsarectangulararrayofelements arrangednrowsandcolumns Vector:Amatrxcontanngonlyonecolumn Transpose:whenrowsbecomecolumnsand rows columns and columnsbecomerows of 84 8 of 84 EqualtyofMatrces: TwomatrcesA A andbb areequalftheyhavetheequal have samedmensonandallcorrespondngelements areequalequal AddtonandSubtracton: TheSumoftwomatrcessanothermatrxhavng matrces s another matrx havng elementsthatarethesumofthecorrespondng elementsnthetwomatrces Multplcatonofamatrxbyascalar Multplcatonofamatrxbyamatrx IdenttymatrxI r r r matrxandj r r matrx Rankofamatrx:mnmumnumberoflnear a mnmum number ndependentcolumns Inverseofamatrx:A A A =AA AA =II 7 8

34 Matrx Approach to Smple Lnear Regresson pp o c o S p e e eg ess o Why mportant? Concse representaton Very useful for multple regresson Very useful for multple regresson Easy to program and analyze large data sets n f l ft h SAS MATLAB t E l powerful software such as SAS, MATLAB etc. Excel and Mntab also have matrx capabltes. 9 9 of 84 Model and data set representaton Model and data set representaton Note that these are matrces n n n All the propertes of the smple lnear the smple lnear regresson model can be derved wth ths representaton 30 n n n 30 of 84 Some mportant formulae n Matrx format Some mportant formulae n Matrx format b Q ' ' ( ' ' ' ' ' H b b ' ' ( ' ' ( H I b SSE H n n ' ' ' ' ( ' ( ' ' ( Where H s Idempotent. It s the Hat Matrx J H I b SSE then s, ' all of matrx a square s If ' ' ' ' ( ' ( Quadratc Forms! J n SSTO ' ' Each of the A matrces are symmetrc. 3 J n H SSE SST SSR ' 3 of 84 ˆ ˆ Q =( ( =( ( Ŷ Ŷ = + dq/d = - + Ths dervatve becomes zero at b where: Ths dervatve becomes zero at b, where: - + b = 0 b = b ( - = ( - b( ( b = ( - b=( - 3 b = ( 3 of 84 Page 34 of 7

33 of 84 34 of 84 Page 35 of 7 0 0 0 3 3 3 0 3 4 3 43 b0 5, b, 5 5 43 b 7 7 6 9 9 7 3 30 8 3 8 30 9 8 0 b ( ' ' KNN Ch.

Can be expressed n short form as, The geometrc nterpretaton s a Response Surface. Meanngoftheyntercept 0 : Ifthescopeofthemodelncludes of =0, =0, etc.

35 33 of of 84 Page 35 of b0 5, b, b b ( ' ' KNN Ch. 6 CC Ch of of 84 An Extenson of Smple Lnear Regresson Interpretaton of parameters s mportant: For example, how would you nterpret n the above model? Can be expressed n short form as, The geometrc nterpretaton s a Response Surface. Meanngoftheyntercept 0 : Ifthescopeofthemodelncludes of =0, =0, etc.then 0 sthemeanresponsee{}at =0, =0,etc.Otherwse,they the yntercepthasno no partcularmeanng Meanngoftheslope : IndcatesthechangenthemeanresponseE{} (expectedchangenperuntncreasenchange n per ncrease, when andalltheotherpredctorsareheld constant

36 37 of of 84 Page 36 of 7 The Matrx Representaton Polynomal Qualtatve Varables Non-lnear? Is ths allowed? of of 84 Formulae for Smple Regresson Apply H s Idempotent. It s the Hat Matrx Quadratc Forms! Each of the A matrces are symmetrc b x b b b ( ' ' b

37 4 of 84 4 of 84 Page 37 of 7 Tests, Estmaton anddagnostcs Tests, Estmaton anddagnostcs All tests and dagnostcs smlar to smple regresson F-test for regresson R and Adjusted R Estmaton of Mean Response and Predcton of New Observaton Smultaneous CIs for Several Mean Responses - Workng-Hotellng or Bonferron (See page 34 Predcton of Mean of m new observatons at h Predcton of g new observatons - Scheffe or Bonferron (See page 35 3-D scatter plots Resdual Plots Correlaton test for Normalty Brown-Forsythe (Modfed Levne test for heteroscedastcty Breusch-Pagan test for heteroscedastcty F-test for lack of ft Fnally, the Box-Cox procedure as a remedal measure of of 84 Cautonshouldbeexercsedforthepredctonnottofalloutsdeof exercsed the not to fall outsde of thescopeofthemodel(observedrangeofthepredctorvarables.thepontshownbelowswthntherangesof and ndvdually,butswelloutsdethejontregonofobservatons. Whattodo?WatuntlwegettoLeveragevalues(KNNch.0 t W tl t t l h Regon covered by and jontly j y A Dfferent Perspectve A Bvarate MR model wth standardzed varables zˆ. z. z Where, the s are standardzed partal regresson coeffcents and are gven as,. r r r r r r, r r. Indvdual range 0 Indvdual range 43 Note that, =. * s /s and =. * s /s The term partal above s used because the terms have been adjusted to allow for the correlaton between ndependent varables. (Check by substtutng r =0 44

38 45 of of 84 Page 38 of 7 A Dfferent Perspectve The Coeffcent of Multple Determnaton Sem-partal Correlaton Coeffcents and Venn Dagrams Multple Regresson -II PartalCorrelaton l C l Coeffcents andvenn Dagrams. Separatng drect, ndrect, spurous and entrely ndrect effects KNN Ch of of 84 Extra Sum of Squares Margnal reducton n SSE when one or several predctor varables are added to the regresson model gven that the other varables are already n the model. In what other, equvalent manner, can you state the above? The word Extra s used snce we would lke to know what the margnal contrbuton (or extra contrbuton s of a varable or a set of varables when added as explanatory varables to the regresson model Decomposton of SSR nto ESS A pctoral representaton s also possble. See page 6, Fg. 7. of KNN SSR( SSR( SSE( SSR(, SSE(, 47 48

39 Decomposton of SSR nto ESS For two or three explanatory varables the formulae are qute easy. Wth two varables we have, And wth three varables, 49 of 84 SSR( SSE( SSE(, SSR(, SSR( SSR(, SSE(, SSE(,, SSR(,, SSR(, Decomposton of SSR nto ESS Note that wth three varables, we may also have, SSR(, 3 SSE( SSE(,, 3 To test tthe hypothess, v/s, the test t statstc s gven as, H : 0 H : 0 0 k * SSR( 3, / To test (say, F SSE ( v/s,,the test,, 3 /( n 4 statstc s gven as, H 0 H, not both 0 0 : 3 50 of 84 a 0 : 3 k Page 39 of 7 Consderng3 adjusted for and as the predctor, ths would be SSR Consderng adjusted for and as the response bvarable, ths would be the SSTO Consderng adjusted for and as the response bvarable, and 3 adjusted d for and d as the predctor, ths would be the SSE 49 SSR(, 3 / F * SSE,, /( n 4 ( 3 50 Decomposton of SSR nto ESS In general however we can wrte, F 5 of 84 RF RR / dfr dff R F / dff * Ths form s very convenent to use snce we do not have to keep track of the ndvdual sums of squares Also, ths form wll mnmze any errors due to subtracton when calculatng the SSRs On the next page we see the ANOVA table wth decomposton of SSR and three varables The ANOVA Table Source of varaton Sum of squares df Mean Squares Regresson SSR,, 3 MSR,,, 3 ( 3 SSR( SSR( ( SSR(, 3 SSE Error n-4 SSTO 5 of 84 Total n- ( 3 MSR( MSR( ( MSR(, 3 MSE(,, 3 5 5

40 Another ANOVA Table (what s the dfference? Source of varaton Sum of squares df Mean Squares Regresson SSR,, 3 MSR,, 3 ( 3 ( 3 SSR MSR( ( 3 3 SSR( 3 SSR,, 3 ( 3 Error n-4 SSE Total n- SSTO 53 of 84 MSR( ( 3 MSR(, 3 MSE(,, 3 The regresson equaton s = An Example Predctor Coeff. StDev. T P Constant S = 80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Error Total Source DF Seq SS of 84 Page 40 of 7 53 Source DF Seq SS SSR Test for a k =0, n a general model Full model wth all varables, 0... k, k k k k, k... p Compute, SSR(,..., k, k, k,..., p Reduced model wthout k Compute,, p 0... k, k k, k... p, p ( k,..., k, k,..., p The test statstc s, 55 of 84 SSR (,..., k, k, k,..., p SSR(,...,,,..., / * k k k p F SSE,...,,,,..., /( n ( k k k p p SSR,...,,,..., ( k k p 55 The regresson equaton s = of 84 An Example Predctor Coef StDev T P Constant S = 80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Error Total The regresson equaton s = Predctor Coef StDev T P Constant S = 97 R-Sq = 94.8% R-Sq(adj = 94.7% Analyss of Varance Source DF SS MS F P Regresson Error Total

41 Test for some k =0, n a general model Full model wth all varables, 0... q, q q q q, q... p See (7.6 pg. 67 of KNN, p Compute, SSR(,..., q, q,..., p Reduced d model wthout t the vector k 0... q, q Compute, SSR( q,..., p,..., q SSR,..., q, q,..., OR, SSR,..., ( p ( q SSR,..., SSR(,...,... SSR,..., ( q q q q The test statstc s, 57 of 84 ( p p p q R... R... / p * p q, or, F R... /( n SSR(,...,..., / q * q p q F SSE,..., /( n p ( p. p p 57 The regresson equaton s = of 84 An Example Predctor Coef StDev T P Constant S = 80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Error Total The regresson equaton s = Predctor Coef StDev T P Constant S = 866 R-Sq = 95.3% R-Sq(adj = 95.3% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Page 4 of of 84 Test for k = q, n a general model Full model wth all varables, 0... k k... q, q... p Compute, SSR,...,,...,,..., ( k q p, p 60 of 84 The regresson equaton s = An Example Predctor Coef StDev T P Constant S = 80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Reduced model wth k + q 0... k ( k q... p Compute, SSR,...,,..., ( k q p, p Source DF SS MS F P Regresson Error Total The regresson equaton s = ( SSR(,...,,...,,..., / (,...,,..., / * k q p SSR k q p F SSE (,...,,...,,..., /( n p Also, when testng say,, or even the above hypothess, k k q e.e. q, one can use the General Lnear Test approach outlned n k KNN. q p 59 Predctor Coef StDev T P Constant ( S = 798 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total

42 Coeffcents of Partal Determnaton Recall the defnton of the coeffcent of (multple determnaton: 6 of 84 R-sq s the proportonate reducton n varaton when the set of varables s consdered d n the model. Now consder a coeffcent of partal determnaton: R-sq for a predctor, gven the presence of a set of predctors n the model, measures the margnal contrbuton of each varable gven that others are already n the model. A graphcal representaton of the strength of the relatonshp between and, adjusted for, s provded by partal regresson plots (see HW6 Coeffcents of Partal Determnaton For a model wth two ndependent varables: Interpret ths: SSR( SSR( r,. r. SSE( SSE( Generalzaton s easy, for e.g., r 3.4 SSR( 3,, 4 SSE(,, ( 4 6 of 84 SSR (, 3 SSE(, r.3 3 etc. Is there an alternate nterpretaton of the above partal coeffcents? What, s say?? r. 3 Page 4 of of 84 An Example 64 of 84 Another Example The regresson equaton s = Predctor Coef StDev T P Constant S = R-Sq =.0% R-Sq(adj = 0.3% The regresson equaton s: = Predctor Coef StDev T P Constant S = 80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total The regresson equaton s = Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Source DF Seq SS Predctor Coef StDev T P Constant S = 9.86 R-Sq = 94.9% R-Sq(adj = 94.9% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total The regresson equaton s: = Predctor Coef StDev T P Constant S = 80 R-Sq = 95.6% R-Sq(adj = 95.5% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total

43 65 of 84 The Standardzed Multple Regresson Model 65 Page 43 of 7 66 of 84 The Standardzed Mult. Regresson Model Why necessary? - Round-off errors n normal equatons calculatons (especally when nvertng a large, matrx. What s the sze of ths nverse for say =b 0 +b.+b Lack of comparablty of coeffcents n regresson models (dfferences n unts nvolved - Especally mportant n presence of multcollnearty. The matrx s almost close to zero n ths case. OK. So we have a problem. How do we take care of t? - The Correlaton Transformaton: - Centerng: Take the dfference between each observaton and the average AND - Scalng: Dvdng the centered observaton by the standard devaton of the varable. ou must have notced that ths s nothng but regular standardzaton? What s the twst? See next slde 66 The Standardzed Mult. Regresson Model Standardzaton s k, ( k,, p s k Correlaton Transformaton ' 67 of 84 n s The Standardzed Mult. Regresson Model Once we have performed the Correlaton Transformaton, then all that remans s to obtan the new regresson parameters. The standardzed regresson model s: p, p where, the orgnal parameters can be had from the transformaton, s k k, k,, p and 0 p p s k 68 of 84 In Matrx Notaton we have some nterestng relatonshps: k ', ( k,, p n s k 67 ( p ( p ( p r r correlaton matrx of the (untransformed varables correlaton matrx of (untransformed and WH? Is ths surprsng? 68

44 69 of 84 An Example Part of the orgnal (unstandardzed data set The regresson equaton s = Predctor Coef StDev T P Constant S = 80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance 70 of 84 An Example (contnued Standardzed and then Correlaton Transformed Page 44 of 7 Source DF SS MS F P Regresson Resdual Error Total of 84 An Example (contnued The regresson equaton s = Predctor Coef StDev T P Constant S = R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Compare to the regresson model obtaned from the untransformed varables, what can we say about the two models? Is there a dfference n predctve power, or s there a dfference n ease of nterpretaton? Why s b 0 =0? Just by chance? 7 7 of 84 Multcollnearty One of the assumptons of the OLS model s that t the predctor varables are uncorrelated. When ths assumpton s not satsfed, then multcollnearty s sad to exst.(thnk about Venn Dagrams for ths Note that multcollnearty s strctly a sample phenomenon. We may try to avod t by dong controlled experments, but n most socal scences research, ths s very dffcult to do. Let us frst, consder the case of uncorrelated predctor varables,.e., no multcollnearty. -Usually occurs n controlled experments -In ths case the R between each par of varables s zero -The ESS for each varable s the same as when the varable s regressed alone on the response varable. 7

45 of 84 An Example The regresson equaton s = Predctor Coef StDev T P Constant S =.9 R-Sq = 5.% R-Sq(adj = 33.0% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Source DF Seq SS The regresson equaton s = of 84 An Example (contnued Predctor Coef StDev T P Constant S = 3.0 R-Sq = 0.9% R-Sq(adj = 0.0% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total The regresson equaton s = Predctor Coef StDev T P Constant S =. R-Sq = 5.3% R-Sq(adj = 43.% Analyss of Varance Page 45 of 7 Source DF Seq SS (From prevous slde 73 Source DF SS MS F P Regresson Resdual Error Total Multcollnearty y( (Effects of The regresson coeffcent or any ndependent varable cannot be nterpreted as usual. One has to take nto account whch other correlated varables are ncluded n the model. The predctve ablty of the overall model s usually unaffected. The ESS are usually reduced to a great extent. The varablty of OLS regresson parameter estmates s nflated. (Let us see an ntutve reason for ths based on a model wth p-= b ( 75 of 84 r r r Note that the standardzed regresson coeffcents have equal standard devatons. Wll ths be the case even when p-=3? Or s ths just a specal case scenaro. Multcollnearty (Effects of Hgh R, but few sgnfcant t-ratos (By now, you should be able to guess the reason for ths Wder ndvdual confdence ntervals for regresson parameters (Ths s obvous based on what we dscussed on the earler slde 76 of 84 0 e.g. What would you conclude based on the above pcture? 75 76

46 Multcollnearty (How to detect t? Hgh R (>0.8, but few sgnfcant t-ratos Caveat: There s a partcular stuaton when the above s caused w/out any multcollnearty. Thankfully ths stuaton t never arses n practce Hgh par-wse correlaton (>0.8 8between ndependent d varables Caveat: Ths s a suffcent, but not necessary condton. For example consder the case where, r =0.5, r 3 =0.5 and r 3 =-0.5. We may conclude, no multcollnearty. However, we fnd that R = when we regress on and 3 together. Ths means that s a perfect lnear combnaton of the two other ndependent varables. In fact the formula for the R s gven as, and one can readly verfy that the numbers satsfy ths equaton. R r r r 3 r r r 3 Due to the above caveat, always examne the partal correlaton coeffcents. 77 of Page 46 of 7 78 of 84 Multcollnearty (How to detect t? Run auxlary regressons,.e. Regress each of the ndependent varables on the other ndependent varables taken together and conclude f t s correlated to the other or not based on the R. The test statstc s, F R ( R The Condton Index (CI: If, multcollnearty..,...,,... p (.,...,,... p / p Maxmum Egen Value 0 CI 30 Mnmum Egen Value CI > 30 means severe multcollnearty. /( n p Moderate to Strong of 84 Multcollnearty (What s the remedy? Rely on jont confdence ntervals rather than ndvdual d ones 0 A pror nformaton of relatonshp between some ndependent varables? Then nclude t! For example: b =b s known. Then use ths n the regresson model whch then becomes, =b 0 + b, (where, = + Data Poolng (Usually done by combnng cross-sectonalsectonal and tme seres data. Tme seres data s notorous for multcollnearty of 84 Multcollnearty (What s the remedy? Delete a varable whch s causng problems Caveat: Beware of specfcaton bas. Ths arses when a model s ncorrectly specfed. For example, n order to explan consumpton expendture, we may only nclude ncome and drop wealth snce t hghly correlated to ncome. However economc theory may postulate that you use bth both varables. bl Frst dfference transformaton of varables from tme seres data The regresson n run on dfferences between successve values of varables rather than the orgnal varables. (, - +, and (, - +, etc. The logc s that even f and are correlated, there s no reason for ther frst dfferences to be correlated too. Caveat: Beware of autocorrelaton whch usually arses due to ths procedure. Also, we lose one degree of freedom due to the dfference procedure. Correlaton transformaton Gettng a new sample (Why? and/or ncreasng sample sze (Why? Factor Analyss, Prncpal Components Analyss, Rdge Regresson 80

47 8 of 84 An Example Page 47 of 7 8 of 84 An Example (contnued The regresson equaton s = Predctor Coef StDev T P Constant S =.87 R-Sq = 95.3% R-Sq(adj = 95.% Pop Income r.997 Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Source DF Seq SS Source DF Seq SS Predcted Values Ft StDev Ft 95.0% CI 95.0% PI (5.394, (4.43, Hgh R Low t-value for b Low ESS for (.e.ssr( Clearly, contrbutes lttle to the model. Really? Look at SSR(..ts humungous!! Clear case of Mult.coll. Of course we knew that r = Ths should have made us suspect that somethng was amss. 8 8 Multcollnearty (Specfcaton Bas Types of Specfcaton Errors 83 of 84 Omttng a relevant varable Includng an unnecessary or rrelevant varable Incorrect functonal form Errors of measurement bas Incorrect specfcaton of stochastc error term (Ths s a model ms-specfcaton error More on omttng a relevant varable (under-fttng 84 of 84 True Model: = Ftted Model : = Consequences of omsson:. If r s non-zero then the estmators of and are based and nconsstent KNN Ch. 8. Varance of estmator of s based estmate of varance of estmator of s ncorrectly estmated and CIs, hypothess tests are msleadng 4. E(Estmator of = b 83 84

48 85 of of 84 Page 48 of 7 A vsual gude to Polynomal lregresson and Dummy varables Quadratc Regresson model: ˆ = b 0 + b + b 87 of of 84 3 rd -order Regresson model: ˆ = b b + b + b 3 Dummy, or ndcator, varables allow for the ncluson of qualtatve varables n the model For example: I = f female 0 f male

49 89 of of 84 Page 49 of 7 Model wth Indcator varable: Model wth Indcator varable: ˆ = b 0 + b + b I ˆ = b 0 + b + b I + b 3 I Rewrte the model as: For I = 0, ˆ = b 0 + b For I =, ˆ = (b 0 + b + b Rewrte the model as: For I = 0, ˆ = b 0 + b For I =, ˆ = (b 0 + b + (b + b 3 9 of 84 9 of 84 Theproblem: ShowrelatonshpbetweenPrceand SquareFeetn3neghborhoods Thesoluton: Comeupwthgooddummyvarables& descrbecase Comeupwthafullmodel Rewrtethefullmodelasmanytmesas necessary Prce Square Feet (Sq. Ft. $50, $57, $70, $68, $50, $580, N= neghborhood 9 9

50 93 of of 84 Page 50 of 7 #ofcategores =#dummyvarables Forexample: Tomodel3qualtyvarables,youneed 3 =dummyvarables Putthemntotheformofaqueston: (yesorno fyes 0fno of of 84 N = f neghborhood 0 f otherwse N3 = f neghborhood 3 0 f otherwse Interacton & N ( Code N N 3 Prce Sq. Ft. N N N $50, $57, $70, $68, $50, $580,

97 of 84 98 of 84 Page 5 of 7 prce ˆ 0 N N 3 3 4 N 5 N 3 pr ce ˆ

ˆ 0 3 5 pr ce ˆ 0 3 5 Rewrtten ths way to emphasze the ntercept and

51 97 of of 84 Page 5 of 7 prce ˆ 0 N N N 5 N 3 pr ce ˆ IfN : of of 84 pr ce ˆ ( ( 0 ( 4 pr ce ˆ 0 4 pr ce ˆ pr ce ˆ Rewrtten ths way to emphasze the ntercept and the slope Rewrtten ths way to emphasze the ntercept and the slope 99 00

52 0 of 84 0 of 84 Page 5 of 7 Prop:Neghborhoods&havethe sameslopentheprce/squarefootage relatonshp eato Prop:Neghborhoods&3have dfferentslopesntheprce/squarefootage n the prce/square footage relatonshp Ho Ha : 4 0 : 4 0 Theresnosgnfcantdfferencentheslopes(or sgnfcant n the slopes B 4 snotsgnfcantlydfferentfromzero. Performt testforb 4 :(wanttofaltoreject LargeP valuefaltoreject null(hypothessssupportednotproved SmallPP valuereject null of of 84 Ho : 4 Ha : Ho : 0 5 Ha : 5 0 B 4 ssgnfcantlydfferentfrombsgnfcantly from 5. B 5 ssgnfcantlydfferentfromzerosgnfcantly from Performt test:(wewanttoreject LargeP valuefaltoreject null(hypothessssupported,notproved SmallP valuereject null Performt test:(wewanttoreject LargeP LargeP valuefaltoreject null(hypothessssupported,notproved supported, not proved SmallP valuereject null (Ifreject:assumeB5sdfferentfromzeroandN&N3havedfferentslopes 03 04

53 05 of of 84 Page 53 of 7 TheeffectofwnnnganOscaronthe actor s slfeexpectancy expectancy Theplayers: Datacollecton. Oscar statuette. Actors wnnng t 3. Stores about Oscar wnners dyng at a late age 4. An academc journal of of 84 Statstcalanalyss Somedescrptvestatstcs 07 08

54 09 of 84 0 of 84 Page 54 of 7 Lfeexpectancy was3.9years longerfor AcademyAward wnners Wewllnowtrytoreplcatethsstudy andverfytheresults,usngsmple e esu ts, us s e regressonanalysstechnques 09 0 of 84 of 84 Oursmpleanalyssfaledtofnd l l f d sgnfcancentheeffectofwnnngan Oscar. Weddfndsomesgnfcancen gender Apparently,mendebeforewomen! Ddyoueverwonderwhy? Buldng the Regresson Model I Selecton and Valdaton KNN Ch. 9

55 3 of 84 4 of 84 Page 55 of 7 The Model Buldng Process Collect tand prepare data Reducton of explanatory varables for exploratory/ observatonal studes Refne model and select best model Valdate model f t passes the checks then adopt t All four of the above have several ntermedate steps. These are outlned n Fg. 9., page 344 of KNN The Model Buldng Process Data collecton Controlled Experments (levels, treatments Wth supplemental varables (ncorporate uncontrollable varables n regresson model rather than n the experment Confrmatory Observatonal Studes (hypothess testng, prmary varables and rsk factors Exploratory Observatonal Studes (Measurement errors/problems, duplcaton of varables, spurous varables, sample sze; are but some of the ssues here 3 4 The Model Buldng Process Data Preparaton 5 of 84 What are the standard technques here? Its an easy guess, a rough-cut approach s to look at varous plots and dentfy obvous problems such as outlers, spurous varables etc. Prelmnary Model Investgaton Scatter Plots and Resdual Plots (For what? Functonal forms and transformatons (of entre data or some explanatory varables or predcted varable? Interactons and..intuton 6 of 84 The Model Buldng Process Reducton of Explanatory Varables Generally an ssue for Controlled Experments wth Supplemental Varables and for Exploratory Observatonal Studes It s not dffcult to guess that for Exploratory Observatonal Studes, ths s more serous Identfcaton of good subsets of the explanatory varables and ther functonal forms and any nteractons, s perhaps the most dffcult problem n multple regresson analyss Need to be careful of specfcaton bas and latent explanatory varables. 5 6

56 The Model Buldng Process Model Refnement and Selecton Dagnostcs for canddate models Lack-of-ft ft tests f repeat obs. avalable Best model s # of varables should be used as benchmark kfor nvestgatng other models wth smlar number of varables Model Valdaton 7 of 84 Robustness and Usablty of regresson coeffcents Usablty of regresson functon. Does t all make sense? 8 of 84 All Possble Regressons: Varable Reducton Usually many explanatory varables (p- present at the outset Select the best subset of these varables Best The smallest subset of varables whch provdes an adequate predcton of. Multcollnearty usually a problem when all varables n the model. Varable selecton may be based on the determnaton coeffcent Rp or on the statstc (Equvalent Procedures. Page 56 of 7 SSE p of 84 All Possble Regressons: Varable Reducton -SSE R p and are hghest when all the p varables are n the model. One ntends to fnd the pont at whch addng more varables causes a very small ncrease n R p or a very small decrease n SSE p. Gven a value of p, we compute the maxmum of R p (or mnmum of SSE p and then we compare the several maxma (mnma. See the Surgcal Unt Example on page 350 of KNN. 0 of 84 A Smple Example Regresson Analyss The regresson equaton s = Predctor Coef StDev T P Constant t S =.80 R-Sq = 95.7% R-Sq(adj = 95.6% Regresson Analyss The regresson equaton s = Predctor Coef StDev T P Constant S =.80 R-Sq = 95.6% R-Sq(adj = 95.5% Regresson Analyss The regresson equaton s = Predctor Coef StDev T P Constant S =.866 R-Sq = 95.3% R-Sq(adj = 95.3% 0

57 of 84 All Possble Regressons: Varable Reducton R p does not take nto account the number of parameters (p and never decreases as p ncreases. Ths s a mathematcal property, but t may not make sense practcally. However, useless explanatory varables can actually worsen the predctve power of the model. How? The adjusted coeffcent of multple determnaton wll account for the ncreased p always. SSE /( n p R a SSTO /( n The R a and MSE p crteron are equvalent When can MSE p actually ncrease wth p? A Smple Example Regresson Analyss The regresson equaton s = Predctor Coef StDev T P Constant S =.878 R-Sq = 99.3% R-Sq(adj = 97.% Regresson Analyss The regresson equaton s = of 84 Predctor Coef StDev T P Constant S =.603 R-Sq = 98.8% R-Sq(adj = 97.7% Regresson Analyss The regresson equaton s = Predctor Coef StDev T P Constant S = 5.86 R-Sq = 9.% R-Sq(adj = 88.3% Page 57 of 7 Interestng All Possble Regressons: Varable Reducton The C p crteron s concerned wth the total MSE of the n ftted values. Total error for any ftted value s a sum of bas and random error components ˆ s the total error, where s the true mean response of when =. The bas s E{ ˆ} and the random error s ˆ E{ ˆ} Then the total mean squared error s shown to be: n [ E { ˆ} ˆ { }] 3 of 84 When the above s dvded by the varance of the actual values.e., by, then we get the crteron p The estmator of p s what we shall use:c p 3 All Possble Regressons: Varable Reducton SSE p C p ( n p MSE (,, P Choose a model wth small C p C p should be as close as possble to p. When all varables are ncluded then obvously C p = p (=P If the model has very lttle bas then n that case E( ˆ and E(C p p 4 of 84 When we plot a lne through the orgn at 45 o and plot the (p,c p ponts, then for models wth lttle bas, the ponts wll fll fall almost on the straght hlne, for models dl wth hsubstantal l bas, the ponts wll fall much above the lne, and f the ponts fall below the then such models have no bas but just some random samplng error. 4

58 5 of 84 All Possble Regressons: Varable Reducton n The PRESS p crteron : PRESS ( ˆ ( p ˆ ( ( s the predcted value of when the th observaton s not n the dataset. Choose models wth small values of PRESS p. It may seem that one wll have to run n separate regressons n order to calculate PRESS p. Not so, as we wll see later. Best Subsets Algorthm: 6 of 84 Best Subsets Best subsets (a lmted number are dentfed accordng to pre-specfed crtera. Requre much less computatonal effort than when evaluatng all possble subsets. Provde good subsets along wth best, whch s qute useful. When pool of varables s large, then ths algorthm can run out of steam. What then? We wll see n the ensung dscusson. Page 58 of of 84 A Smple Example 8 of 84 Forward Stepwse Regresson Best Subsets Regresson (Note: s s the square root of MSE p Response varable s Adj. Vars R-Sq R-Sq C-p s Response varable s Adj. Vars R-Sq R-Sq C-p s An teratve procedure Based on the partal F * or t * statstc one decdes whether to add a varable or not. One varable at a tme s consdered. d Before we see the actual algorthm here are some levers: Mnmum acceptable F to enter (F E Mnmum acceptable F to remove (F R Mnmum acceptable Tolerance (T mn Maxmum number of teratons (N And here s the general form of the test statstc: F * k MSR Other s already n the model k bk (,Other s already n the model { } MSE k s bk 8

59 Forward Stepwse Regresson The procedure: 9 of 84. Run a smple lnear regresson of all varables wth the varable.. If none of the ndvdual F values are larger than the cut-off F E value, then stop. Else, enter the varable wth the largest F. 3. Now run the regresson of remanng varables wth gven that the varable entered n step s already n the model. 4. Repeat step. If a canddate s found, then check for tolerance. If tolerance (-R k s not larger than cut-off tolerance value T mn, then choose a dfferent canddate. If none avalable, then termnate. Else, add the canddate varable. 5. Calculate the partal F for the varable entered n step gven that the varable entered n step 4 s already n the model. Check f ths F s less than F R. If so, then remove the varable entered n step. Else keep t. Check f number of teratons s equal to N. If yes, termnate. If not, then proceed to step Check from results of step, whch h s the next canddate varable to enter. If number of teratons exceeded, then termnate 9 Page 59 of 7 30 of 84 Other Stepwse Regresson Procedures Backward Stepwse Regresson exact opposte of forward procedure. Sometmes preferred to forward stepwse. Thnk k about how ths procedure would work why, or under whch condtons you would use t nstead of forward stepwse? Forward Selecton Smlar to forward stepwse; except that the varable droppng part s not present Backward Elmnaton Smlar to backward stepwse; except that the varable addng part s not present 30 3 of 84 An Example 3 of 84 Let us go through the example (Fg. 9.7 on page 366 of KNN. AkakeInformatonCrtera(AIC Imposeapenaltyforaddngregressors penalty addng regressors AIC= e p/n SSE p /n,wherep/nsthepenaltyfactor HarsherpenaltythanR than a (How? ModelwthlowestAICspreferred AICusedfornsampleandoutofsampleforecastng sample and out of sample forecastng performancemeasurement Usefulfornestedandnonnestedmodeandforfor nonnested for determnnglaglengthnautoregressvemodels(ch 3 3

60 SchwarzInformatonCrtera(SIC SIC=n p/n SSE p /n SmlartoAIC 33 of 84 ImposesstrcterpenaltythanAIC HassmlaradvantagesasAIC 33 Model Valdaton Checkng the predcton ablty of the model. Methods for the model valdaton;. Collecton of new data; 34 of 84 - We select a new sample wth the same varables of dmenson ; - Compute the mean squared predcton error: ( ˆ MSPR * n. Comparson of results wth theoretcal expectatons; 3. Data splttng n two data sets: model buldng and valdaton. n * Page 60 of of of 84 Outlyng Observatons KNN Ch. 0 (pp At tmes data sets have observatons that are outlyng or extreme. These outlers usually have a strong effect on the regresson analyss. We have to dentfy such observatons and then decde f they need to be elmnated or f ther nfluence needs to be reduced. When dealng wth more than one varable, smple plots (boxplots, scatterplots etc. may not be useful to dentfy outlers and we have to use the resduals or functons of resduals. We wll now look at some of these functons

61 Resduals and Semstudentzed Resduals Prevously, we examned: Resduals 37 of 84 Semstudentzed Resduals e ˆ e * e MSE We wll now ntroduce a few refnements that are more effectve n dentfyng outlers. Frst we need to recall the Hat Matrx. Leverages We prevously defned the Hat matrx as H = ( - Usng the hat matrx, 38 of 84 ˆ H and e = (I-H The dagonal elements of the hat matrx, h, 0< h <, are called Leverages These are used to detect nfluental observatons. Leverage values are useful for detectng hdden extrapolatons when p > 3 Page 6 of of 84 Measures for -outler detecton 40 of 84 Measures for -outler detecton An estmator of the st. devaton of the -th resdual s MSE h ( Therefore,, dvdng each resdual by ts st. devaton we obtan the e Studentzed Resduals: r MSE( h Another effectve measure for outler dentfcaton s obtaned when we delete observaton, ft the regresson functon to the remanng n observatons, and obtan the expected value for that observaton gven ts levels. l The dfferences between the predcted d and the actually observed value produces a deleted resdual. Ths can be also expressed usng a leverage value. e Deleted Resduals: d ˆ ( h Studentzed Deleted Resduals d e n p t e ( ( ( s d MSE h SSE( h e ~ tn p 39 40

62 4 of 84 Detecton of outlyng Observatons 4 of 84 Outlyng Observatons Page 6 of 7 Crteron for Outlers: In order to establsh that the th observaton s an outler we have to compare the value of t wth t, where t s the 00*(-/n th percentle of the t dstrbuton wth (n-p- degrees of freedom. The average value s Crteron for Outlers: h p / n If h > p/n, then observaton s an outler of 84 A Smple Example 44 of 84 A Smple Example (contnued Pop Income 3 Regresson Analyss The regresson equaton s = Predctor Coef StDev T P Constant S =.80 R-Sq = 95.7% R-Sq(adj = 95.6% Land Beds Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Pred. Resd. Stud.Res. Del. Stud. Res. h

63 45 of 84 Influence of Outlyng / Observatons 46 of 84 Page 63 of 7 Influence of Outlyng / Observatons Influence on sngle ftted value: nfluence that case has on the ftted value. Omsson s the test. Excluson causes major changes n ftted regresson functon; then a case s ndeed dnfluental. l Crtera for Influental observatons: f DFFITS > (small to medum data sets p Or f DFFITS > (large data sets n Where: DFFITS ˆ MSE ( ( h t h h 45 An aggregate measure s also requred: One whch measures the effect of omsson of case on all n ftted values, not just the -th ftted value. Statstc s Cook s Dstance: D n ( ˆ ˆ j j( j e h pmse pmse ( h Crteron for Influental Observatons: Compare D wth the F dstrbuton wth (p, n-p degrees of freedom. If the percentle (that t D cuts off from the left sde of the dstrbuton curve s 0 or 0 the observaton has lttle nfluence, f ths percentle s 50 or more the nfluence s large of 84 Influence of outlers on betas Another measure s requred: One whch measures the effect of omsson of case on OLS estmates tes of regresson coeffcents (betas. b k b k ( DFBETASk( MSE( ckk Here, c kk s the k-th dagonal element of ( - Crtera for Influental observatons: f DFBETAS > for small data sets, or f DFBETAS > for large data sets. n Regresson Analyss The regresson equaton s = of 84 A Smple Example Predctor Coef StDev T P Constant S =.80 R-Sq = 95.7% R-Sq(adj = 95.6% Analyss of Varance Source DF SS MS F P Regresson Resdual Error Total Resd. Stud.Res. Del. Stud. Res. h DFFITS COOKD

49 of 84 50 of 84 Page 64 of 7 Quanttatve Forecastng Tme Seres Models Causal Models KNN Ch. (pp.

64 49 of of 84 Page 64 of 7 Quanttatve Forecastng Tme Seres Models Causal Models KNN Ch. (pp Movng Average Exponental Smoothng Trend Models Regresson of 84 5 of 84 Tmeseresdatasasequenceof observatons collectedfromaprocess wthequallyspaced perodsoftme. Contrarytorestrctonsplacedoncrossto restrctons cross sectonaldata,themajorpurposeof forecastngwthtmeseresstoextrapolate tme seres s to extrapolate beyondtherangeoftheexplanatoryvarables. Smoothng Methods Movng Average No Exponental Smoothng Tme Seres Trend? es Lnear Quadratc Exponental Trend Models Auto- Regressve 5 5

65 Regresson Model wth AR( error t 0 t t 53 of 84 t u The errors u t are ndependent and normally dstrbuted N(0, The autoregressve parameter has < t t Multple Regresson Model wth AR( error The prevous smple regresson model can be expanded to accommodate multple predctors t 0 t t... t 54 of 84 t u t p t p t Page 65 of of 84 Autoregressve expanson 56 of 84 Autoregressve expanson The autocorrelaton parameter s the correlaton coeffcent between adjacent error terms Expandng the defnton of t, ( u u u t t t t t t t u u 3 t3 t t u t u t The correlaton coeffcent dmnshes over tme, snce < Ths s why an ACF plot exhbts a dmnshng correlaton pattern for AR( models: ACF PACF Autoregressve Random error component component 55 56

66 Remedal measures for AR errors n regresson models Cochrane Orcutt procedure Hldreth Lu procedure 57 of 84 Frst t dfferences procedure All estmates are close to each other, the last procedure s the smplest Frst Dfferences procedure t u t t t t t t t t Back transformatons: ˆ b b 0 b 0 b b b 58 of 84 (regresson through the orgn Page 66 of of 84 The Blasdell Company Example (Blasdell.xls ear Quarter t CompanySales IndustrySales The Blasdell Company Example (regresson through the orgn t t t t t t t t Back transformatons: b b of 84 ˆ

67 6 of 84 Forecastng 6 of 84 Page 67 of 7 Forecasts obtaned wth autoregressve error regresson models are condtonal on the past observatons Usng recursve relatons, two or three-step ahead forecasts can be obtaned, but predcton ntervals wll expand very fast KNN Ch. 4 (pp of 84 Regresson Models wth Bnary Response Varable In many applcatons the response varable has only two possble outcomes (0/: In a study of lablty nsurance possesson, usng Age of head of household, Amount of lqud assets, and Type of occupaton of head of household as predctors, the response varable had two possble outcomes: House has lablty nsurance (=, or Household does not have lablty nsurance (=0 The fnancal status of a frm (sound status, headed toward nsolvency can be coded as 0/ Blood d pressure status t (hgh h blood pressure, not hgh h blood pressure can be coded as 0/ Meanng of the Response Functon for Bnary Outcomes Consder the smple lnear regresson model 0, 0, 64 of 84 E 0 In ths case, the expected response E{ } has a specal meanng. Consder to be a Bernoull random varable: Probablty P( = = 0 P( =0 =

68 65 of of 84 Page 68 of 7 Meanng of the Response Functon for Bnary Outcomes Usng the defnton of expected value of a random varable, E ( 0( P ( E 0 Therefore, the mean response E{ } s the probablty that = when the level of the predctor varable s. E{} 0 E{} = b 0 + b Problems when Response Varable s Bnary. Error Terms are not normal: At each level, the error cannot be normally dstrbuted snce t takes only possble values, dependng on whether s 0 or. Error Varance s not constant: Error Varance s a functon of, therefore not constant 3. Constrants wth the response functon: We need to fnd response functons that do not exceed the value of, and that s not easy of of 84 Lnk Functons Inverse of dstrbuton functons have a sgmod shape that can be helpful as a response functon of a regresson model wth bnary outcome. Such a functon s called Lnk Functon. We want to choose a lnk functon that best fts our data. Goodness-of-ft ft statstcs t t can be used to compare fts usng dfferent lnk functons: Name Lnk Functon Dstrbuton Mean Varance logt g( = log( / (- logstc 0 p / 3 normt/probt g( = - ( normal 0 gompt g( = log(-log(- Gumbel - (Euler c. p / 6 67 logt transformaton Assumpton: The logt transformaton of the probabltes of the target value results n a lnear relatonshp wth the nput varables. 68

69 69 of of 84 Page 69 of 7 Interpretaton of Parameter Estmates Lnear Regresson ess Target s an nterval varable. Input varables have any measurement level. Predcted values are the mean of the target varable at the gven values of the nput varables. Logstc Regresson ess Target s a dscrete (bnary or ordnal varable. Input varables have any measurement level. Predcted values are the probablty of a partcular level(s of the target varable at the gven values of the nput varables. The nterpretaton of the parameter estmates depends on The lnk functon The reference event ( or 0 The reference factor levels (for numercal factors, reference level s the smallest value The logt lnk functon provdes the most natural nterpretaton of the estmated coeffcents: The odds of a reference event s the rato of P(event to P(not event. The estmated coeffcent of a predctor (factor or covarate s the estmated change n the log of P(event/P(not event for each unt change n the predctor, assumng the other predctors reman constant t of 84 7 of 84 w E( =x = g(x;w E( g - ( =x=g(x;w p (x w 0 + w x + + w p x p log(odds logt(p p log( g - ( p = w 0 + w x + + w p x - p p.0 logt(p w p 0.5 Tranng Data Generalzed Lnear Model Tranng Data

70 73 of of 84 Page 70 of 7 Tranng Data p log( - p log p - p ( = w 0 + w x + + w p x p p = wexp(w 0 + +ww log 0 (x +w ( ++ + x + +w w p x - p p odds rato To dentfy Use Whch h measures poorly ft factor/covarate patterns Pearson resdual the dfference between the actual and the predcted observaton factor/covarate patterns wth strong nfluence on changes n the coeffcents when the j-th factor/covarate pattern s removed, based on Pearson parameter estmates delta beta resduals factor/covarate patterns Leverage leverages of the j-th factor/covarate pattern, a measure of how unusual wth a large leverage (H predctor values are of of 84 HMEQOvervew Determnewhoshouldbe approvedforahomeequtyloan. Thetargetvarablesabnary varablethatndcateswhetheran ndcates an applcanteventuallydefaultedon theloan. Thenputvarablesarevarables are suchastheamountoftheloan, amountdueontheexstng mortgage,thevalueofthethe of the property,andthenumberofrecent credtnqures. Theconsumercredtdepartmentofabankwantstoautomatethe d t t t t t t th decsonmakngprocessforapprovalofhomeequtylnesof credt.todoths,theywllfollowtherecommendatonsofthe EqualCredtOpportuntyActtocreateanemprcallydervedand to create an emprcally derved and statstcallysoundcredtscorngmodel.themodelwllbebased ondatacollectedfromrecentapplcantsgrantedcredtthrough thecurrentprocessofloanunderwrtng.themodelwllbebult of underwrtng wll bult frompredctvemodelngtools,butthecreatedmodelmustbe suffcentlynterpretablesoastoprovdeareasonforanyadverse actons(rejectons. TheHMEQdatasetcontansbaselneandloanperformance nformatonfor5,960recenthomeequtyloans.thetarget(bad sabnaryvarablethatndcatesfanapplcanteventually ndcates f an applcant eventually defaultedorwasserouslydelnquent.thsadverseoutcome occurredn,89cases(0%.foreachapplcant,nput varableswererecorded. recorded

71 77 of of 84 Page 7 of 7 Name Model Role Measurement Level Descrpton BAD Target Bnary =defaulted on loan, 0=pad back loan REASON Input Bnary HomeImp=home mprovement, DebtCon=debt consoldaton JOB Input Nomnal Sx occupatonal categores LOAN Input Interval Amount of loan request MORTDUE Input Interval Amount due on exstng mortgage VALUE Input Interval Value of current property DEBTINC Input Interval Debt-to-ncome rato OJ Input Interval ears at present job DEROG Input Interval Number of major derogatory reports CLNO Input Interval Number of trade lnes DELINQ Input Interval Number of delnquent trade lnes CLAGE Input Interval Age of oldest trade lne n months NINQ Input Interval Number of recent credt nqures Thecredtscorngmodelcomputesa probabltyofagvenloanapplcantdefaultng bl l f l onloanrepayment.athresholdsselected suchthatallapplcantswhoseprobabltyof h l h bl defaultsnexcessofthethresholdare recommendedforrejecton. df of of 84 Formodelcomparsonpurposes,weaddedtwo varables: BEHAVIOR(good/bad,whchprecselymrrorsthe 0/valuesnBAD,toseehowwecanperfectly tl predctbadusngnsdernformaton FLIPCOIN(Head/Tal,whchscompletelyrandom, s completely toseefwecanpredctbadusngrandomflpsofa con Enterprse-grade (and expensve! Data Mnng package Implemented Methodology: Sample-Explore-Modfy-Model-Assess (SEMMA Avalable Modelng Tools: Logstc Regresson Many others, such as Decson Trees, Neural Networks, Clusterng, Market-Basket, etc

8 of 84 8 of 84 Page 7 of 7 Three logstc Regresson nodes were

InBaselneRegresson,0%oftheborrowersdefault, of

72 8 of 84 8 of 84 Page 7 of 7 Three logstc Regresson nodes were added to the Analyss Dagram. In order to compare them, a Compare node was added of of PerfectRegressons,ofcourse,perfect. InBaselneRegresson,0%oftheborrowersdefault, of borrowersdefault regardlessoffttedvalue StepwseRegressonssomewherebetweentheothertwo somewhere the other two models 84

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands Content. Inference on Regresson Parameters a. Fndng Mean, s.d and covarance amongst estmates.. Confdence Intervals and Workng Hotellng Bands 3. Cochran s Theorem 4. General Lnear Testng 5. Measures of