Data Analysis, Statistics, Machine Learning

Data Analysis, Statistics, Machine Learning Leland Wilkinsn Adjunct Prfessr UIC Cmputer Science Chief Scien<st H2O.ai leland.wilkinsn@gmail.cm

Mst sta<s<cal predic<n mdels take ne f tw frms y = Σ j (β j x j ) + ε (addi<ve func<n) y = f(x j, ε) (nnlinear func<n) The dis<nc<n is imprtant The first frm is called an addi<ve mdel The secnd frm is called a nnlinear mdel Addi<ve mdels can be curvilinear (if terms are nnlinear) Nnlinear mdels cannt be transfrmed t linear Examples f linear r linearizable mdels are y =β 0 + β 1 x 1 + + β p x p + ε y =αe βx+ ε Examples f nnlinear mdels are y =β 1 x 1 / β 2 x 2 + ε y = lgβ 1 x 1 ε 2 Cpyright 2016 Leland Wilkinsn

Regressin predicts a set f values n a variable y frm values n ne r mre x variables. This is dne by fivng a mathema<cal func<n that, fr any value(s) n the x variable(s), yields the mst prbable value f y. Simple linear mdel is y i = β 0 + β 1 x i + i Es<mates are ŷ i = ˆβ 0 + ˆβ 1 x i The badness r lss f this predic<n in a sample f values is represented by the discrepancies between the y values and their crrespnding predicted values. lss = n (y ŷ) 2 i=1 3 Cpyright 2016 Leland Wilkinsn

Ordinary Least Squares (OLS) Legendre (1805) Of all the principles that can be prpsed fr this purpse, I think there is nne mre general, mre exact, r easier t apply, than that which we have used in this wrk; it cnsists f making the sum f the squares f the errrs a minimum. By this methd, a kind f equilibrium is established amng the errrs which, since it prevents the extremes frm dmina<ng, is apprpriate fr revealing the state f the system which mst nearly appraches the truth. (transla<n by S<gler, 1986) 4 Cpyright 2016 Leland Wilkinsn

Regressin Francis Galtn (1887) He n<ced that the blue lines were the same length as the red nes That is, the line best predic<ng Y frm X regressed away frm the majr axis 5 Cpyright 2016 Leland Wilkinsn

Regressin Francis Galtn (1887) His data weren t as clean and linear as he imagined, but that didn t mager Height f Mid-Parent in Inches 75 70 65 60 60 65 70 75 Height f Child in Inches Wachsmuth, Wilkinsn & Dallal, 2003 6 Cpyright 2016 Leland Wilkinsn

Regressin Francis Galtn (1887) That s because he aggregated ver different surces 80 Height f Father 70 60 50 80 Height f Mther 70 60 50 50 60 70 80 Height f Sn 50 60 70 80 Height f Daughter Wachsmuth, Wilkinsn & Dallal, 2003 7 Cpyright 2016 Leland Wilkinsn

Es<ma<ng via Ordinary Least Squares 8 Cpyright 2016 Leland Wilkinsn

Es<ma<ng via Ordinary Least Squares Fr intercept (b 0 ) and slpe (b 1 ), we culd use calculus The way we did when we used maximum likelihd t es<mate mean and sd In this case, we want t minimize the sum f squared residuals SSE Given y i = b 0 b 1 x i + e i We sum the e i t get SSE SSE = n (y i b 0 b 1 x i ) 2 i=1 Cmpute the par<al deriva<ves with respect t b 0 and b 1 Set these deriva<ves t zer (where the minimum SSE exists) Slve the resul<ng simultaneus equa<ns This is what Legendre riginally did But there is an easier way 9 Cpyright 2016 Leland Wilkinsn

Es<ma<ng via OLS (we ll use matrices) Y = XB + E XB E X Y = X XB + X E (X X) 1 X Y =(X X) 1 (X X)B (X X) 1 X Y = B E = Y XB y 6 5 4 3 2 Y = 3 2 2 4 4 6 5 X = 1 2 1 1 1 2 1 3 1 4 1 5 1 6 B = 1.25 0.75 E = 0.25 0.00.75 0.50.25 1.00.75 1 2 3 4 5 6 x 10 Cpyright 2016 Leland Wilkinsn

Es<ma<ng the regressin mdel parameters Hw d we knw we minimized the errr sum f squares? X Y = XY cs θ cs θ = X Y XY Pearsn crrela<n cefficient (if data are centered) length f E = E E Y Y and X are fixed (because f θ and X and Y ) Shrtest distance frm pint Y t line X is E QED θ XB X 11 Cpyright 2016 Leland Wilkinsn

Es<ma<n (assuming we sampled frm a ppula<n) B = ˆβ0 ˆβ 1 is a least- squares es<matr unbiased (assuming residuals are hmgeneus) smallest variance amng unbiased es<matrs Best Linear Unbiased Es<matr (BLUE) If the residuals are nrmally distributed B is a maximum likelihd es<matr We can d classical sta<s<cal tests n es<mates f parameters 12 Cpyright 2016 Leland Wilkinsn

Es<ma<ng via OLS Tw predictrs (same frmulas) Y = XB + E XB E X Y = X XB + X E (X X) 1 X Y =(X X) 1 (X X)B (X X) 1 X Y = B E = Y XB Y = 3 2 2 4 4 6 5 X = 1 2 3 1 1 4 1 2 2 1 3 1 1 4 2 1 5 4 1 6 3 B = 0.897 0.746 0.135 E = 0.206.183.659 0.730.151 0.833.778 13 Cpyright 2016 Leland Wilkinsn

The tw- predictr vectr space Y = XB + E Wikipedia 14 Cpyright 2016 Leland Wilkinsn

The tw- predictr vectr space It is pssible fr y nt t be crrelated much with either X 1 r X 2 yet be highly crrelated with the linear cmbina<n f X 1 and X 2 S dn t thrw ut predictrs by lking at their crrela<ns with the dependent variable X 1 E θ Y XB X 2 15 Cpyright 2016 Leland Wilkinsn

The tw- predictr vectr space Z is nt significantly related t X Z is nt significantly related t Y But the mul<ple crrela<n f Z with X and Y is almst 1! 0.03 0.03 0.02 0.02 0.01 0.01 Z 0.00 Z 0.00-0.01-0.01-0.02-0.02-0.03-3 -2-1 0 1 2 3 X -0.03-3 -2-1 0 1 2 3 Y 16 Cpyright 2016 Leland Wilkinsn

Standardized Regressin Cefficients (Beta Weights) Standardize variables, then cmpute regressin Remves scales frm cnsidera<n f size f cefficients Scial scien<sts lve this stuff Why then are crrela<n cefficients s agrac<ve? Only bad reasns seem t cme t mind. Wrst f all, prbably, is the absence f any need t think abut units fr either variable. Given tw perfectly meaningless variables, ne is reminded f their meaninglessness when a regressin cefficient is given, since ne wnders hw t interpret its value. Tukey (1969) 17 Cpyright 2016 Leland Wilkinsn

Sums f Squares SSR = n (ŷ i ȳ i ) 2 i=1 regressin sum f squares (explained) SSE = n (y i ŷ i ) 2 i=1 errr sum f squares (unexplained) SST = n (y i ȳ i ) 2 i=1 ttal sum f squares 18 Cpyright 2016 Leland Wilkinsn

Gdness f fit Pearsn crrela<n ρ X,Y = ˆρ X,Y = r X,Y = COV (X, Y ) V AR(X)V AR(Y ) X Y XY (if X and Y are centered) Mul<ple crrela<n (sqrt f cefficient f determina<n) R 2 = n i=1 (ŷ i ȳ i ) 2 n i=1 (y i ȳ i ) 2 regressin sum f squares / ttal sum f squares R xx = 1 r x1 x 2 r x1 x p r x2 x 1 1 r x2 x p.... r xp x 1 r xp x 2 1 R yx = r yx1 r yx2 r yxp R 2 = R yx Rxx 1 R xy sum f squared crrela<ns with Y / crrela<ns amng X 19 Cpyright 2016 Leland Wilkinsn

Inference F- sta<s<c (test f significance fr verall predic<n) F p,n p 1 = R 2 /p (1 R 2 )/(n p 1) Cnfidence intervals n regressin cefficients c j = diag(x X) 1 j SSE s = n p 1 CI =(ˆβ ± t α/2 n p 1 s c j ) 20 Cpyright 2016 Leland Wilkinsn

An interes<ng applica<n f Ordinary Least Squares We have Z n x p : n fixed pints in p dimensins (cell twers, MDS r ther cnfigura<n, ) d n : a new pint y s es<mated distance t each pint in Z Slve fr vectr b p specifying lca<n f new pint d Z X y i j m s = n i=2 i 1 j=1 k=1 p (Z ik Z jk ) 2 (a scaling cnstant) y m = d 2 i d 2 j + s, j =1,,i 1, i =2,,n, m=(i 1)(i 2)/2+j X mk = 2(Z ik Z jk ), j =1,,i 1, i =2,,n, m=(i 1)(i 2)/2+j b =(X X) 1 X y 22 Cpyright 2016 Leland Wilkinsn

OLS Assump<ns The true mdel is linear in the parameters The X variables are nt randm and are measured withut errr The X variables are nt cllinear Residuals are uncrrelated/independent The residuals have cnstant variance (hmscedas<city) If t and F test sta<s<cs are cmputed Residuals must be Nrmally distributed Randm sample frm a ppula<n 23 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns The true mdel is linear in the parameters Use LOESS t put a smth thrugh residual plt 24 Cpyright 2016 Leland Wilkinsn

Transfrma<ns Dealing with nnlinearity Wilkinsn, Blank, & Gruber (1996) 25 Cpyright 2016 Leland Wilkinsn

Transfrma<ns Dealing with nnlinearity Tukey- Msteller bulging rule Ascend y ladder (p > 1) x = x p y = y p Descend x ladder (p < 1) Ascend x ladder (p > 1) Descend y ladder (p < 1) 26 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns The X variables are nt randm and are measured withut errr There is n simple test r graphic fr this Knw the surce f yur measurements If there is measurement errr, yu can use errrs- in- the- variables methds Or, yu can just frgeddabu<t, which is what mst f the wrld des Wh cares if yur cefficient es<mates are biased? They are usually biased dwnward, s n harm dne Just dn t shw them t a scial scien<st Scial scien<sts lve latent variable mdels They are Platnists 27 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns The X variables are nt cllinear Values f Variance Infla<n Factr (VIF) abve 10 are wrrisme VIF = 1 1 R 2 i R 2 =.995 R 2 =.97 28 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns Residuals are uncrrelated/independent An ACF plt can help spt vila<ns, but there are ther types f dependencies Ecnmists spend all their <me wrrying abut this And fr gd reasn serial dependence is mre txic than utliers Dn t even THINK f using rdinary linear regressin (trend line) n <me series data 29 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns The residuals have cnstant variance (hmscedas<city) Try transfrma<ns (usually lgging wrks fr a pwer mdel) Or use ne f the heterscedas<city crrec<ns (MacKinnn- White, etc.) 30 Cpyright 2016 Leland Wilkinsn

Transfrma<ns Dealing with heterscedas<city lg- lg transfrma<n 31 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns If t and F test sta<s<cs are cmputed Residuals must be Nrmally distributed Randm sample frm a ppula<n Frget abut tests fr nrmality; they are wrthless 32 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns If t and F test sta<s<cs are cmputed Residuals must be Nrmally distributed 33 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns Leverage Diagnal f the hat matrix (puts the hat n the parameters) H = X(X X) 1 X Ck s D Measures the decrement in predic<n by remving an bserva<n A type f leverage measure that is transfrmable t an apprximate F 34 Cpyright 2016 Leland Wilkinsn

Evalua<ng OLS assump<ns Leverage 35 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n Par<al Residual Plts Each plt cnsists f sets f residuals plged against each ther. One set, n the ver<cal axis, cnsists f the residuals frm regressing y n all the X (predictr) variables except fr the predictr n the hrizntal axis. The ther set, n the hrizntal axis, cnsists f the residuals frm regressing X i n all the ther X variables. The result is a plt which shws yu hw each X variable is related t the Y variable when all the ther X variables are taken int accunt. Par<al residual plts have several useful features: The slpe f the line in the plt is the par<al regressin regressin cefficient crrespnding t the predictr in the plt. A line with a steep slpe and nice lking residuals is a sign that a predictr belngs in the mdel. The residuals frm this line are the same as the residuals frm regressing Y n all the predictrs. The plt helps yu t judge whether the cndi<nal rela<nship is linear r nnlinear. Extreme values n the hrizntal axis help yu t iden<fy high leverage pints. 36 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n Par<al Residual Plts time = 10.679 + 6.697 distance + 0.008 climb 37 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n Frward Selec<n Regressin Backward Elimina<n Regressin Stepwise Regressin All Pssible Subsets Regressin 38 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n Myers, J.L., Well, A.D., Lrch Jr, R.F. (2013). Research Design and Sta<s<cal Analysis (3 rd Ed.). Rutledge. 39 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n Akaike Infrma<n Criterin (AIC) Lngley Data: Predic<ng TOTAL (RSQ =.995) 40 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n (m submdels) AIC = 2(p + 2) 2 lg(l) AIC k = AIC k AIC min L k =exp( AIC/2) W k = L k / m i=1 L i rela<ve AIC rela<ve likelihd AIC weights RSQ Increment AIC Weight 41 Cpyright 2016 Leland Wilkinsn

Mdel Selec<n (m submdels) 42 Cpyright 2016 Leland Wilkinsn

Other appraches t mul<cllinearity Regulariza<n Ridge regressin (Herl, 1962) Minimize n (y i x i β) 2 + λ i=1 p j=1 β 2 j S, ˆβ =(X X + λi) 1 X y We inflate the diagnal t reduce the influence f the ff- diagnal cvariances Ridge regressin shrinks the es<mates tward zer, intrducing bias But this reduces the variance f the es<mates 43 Cpyright 2016 Leland Wilkinsn

Other appraches t mul<cllinearity Regulariza<n LASSO (Least Abslute Value Shrinkage and Selec<n Operatr) Minimize Slu<n requires itera<n n (y i x i β) 2 + λ i=1 Allws sme cefficients t g t zer with thers nnzer Efrn and Tibshirani use Least Angle Regressin with Shrinkage (LARS) Similar t stepwise regressin p β j j=1 44 Cpyright 2016 Leland Wilkinsn

Lgis<c Regressin OLS es<mates 2 1 PHD 0-1 400 500 600 700 800 900 GRE 45 Cpyright 2016 Leland Wilkinsn

Lgis<c Regressin LR es<mates 2 1 PHD 0-1 400 500 600 700 800 900 GRE 46 Cpyright 2016 Leland Wilkinsn

Lgis<c Regressin Mdel and es<ma<n p(x; β) = lgit(p) = lg L(β; x 1,...,x n )= l(β; x 1,...,x n )= exβ 1+e xβ = 1 1+e xβ p(x) 1 p(x) = xβ n i=1 p y i i (1 p i) 1 y i n lgit i lg(1 + p i ) i=1 Must use itera<ve p<miza<n t find maximum Different mdel fr mre than tw categries x = x 0,,x p β = β 0,, β p lg- dds likelihd (Binmial) lg- likelihd 47 Cpyright 2016 Leland Wilkinsn

Pissn Regressin (lg- linear mdel) Mdel and es<ma<n E[Y x] =e xβ Pissn distribu<n has nly 1 parameter p(y; x, β) = eyxβ e exβ y! L(β; y, x 1,...,x n )= l(β; y, x 1,...,x n )= n e y ix i β e ex i β y i! n {(y i x i β) e xiβ lg(y i!)} i=1 i=1 y is integer valued likelihd lg- likelihd (dn t need the yellw term In rder t maximize) Must use itera<ve p<miza<n t find maximum 48 Cpyright 2016 Leland Wilkinsn

Pissn Regressin OLS Pissn 49 Cpyright 2016 Leland Wilkinsn

Generalized Linear Mdels (GLM) Nelder & Wedderburn (1972) Nt t be cnfused with General Linear Mdel (GLM) fr OLS with r withut dummy variables Generalized Least Squares (GLS) fr dealing with heterscedas<city Es<ma<n dne thrugh Itera<vely Reweighted Least Squares (IRWLS) xβ = g(e[y ]) E[Y ]=µ = g 1 (xβ) link func<n mdeling mean f Y thrugh inverse f link func<n I Distribu<n Link Func<n Name Link Func<n Nrmal Binmial Pissn Iden<ty Lgit Lg xβ = µ µ xβ = lg 1 µ xβ = lg(µ) 50 Cpyright 2016 Leland Wilkinsn

Nnlinear Regressin Func<ns f the frm y = f(x + ε) Le} is linearizable by transfrma<n, right is intrinsically nnlinear y = e β 0+β 1 x+ y = e β 0+β 1 x + 51 Cpyright 2016 Leland Wilkinsn

Nnlinear Regressin a+b/x+c lg(x)) E[Y ]=e Minimize SSE (Newtn, Quasi- Newtn, Metrplis, ) Magic, right? 52 Cpyright 2016 Leland Wilkinsn

Nnlinear Regressin Things t wrry abut Dn t waste yur <me with R 2 There are all srts f defini<ns All are nnsense The bad nes are ridiculusly large Lk at sum f squared errrs instead Dn t hunt arund fr nnlinear equa<ns that wrk That s a fishing expedi<n The equa<n yu chse shuld be driven by thery and the dmain it applies t Half the <me yur itera<ve fivng methd will crak That s a sign that there s a lcal minimum Or yur equa<n is wrng Or yur star<ng values are wildly ff- target Dn t waste yur <me lking fr anther cmputer prgram Nnlinear prgrams are ntriusly finicky Yur data are prbably crap r yur mdel is ridiculus Lk at the fit befre yu examine any sta<s<cs Mst f the <me yu have ne predictr and ne dependent variable S lk at the figed equa<n Ignre the fit sta<s<cs and tests f significance un<l it lks gd EVERYTHING fits Keep in mind that almst any bad mdel will lk pregy gd If it desn t make there<cal sense, dn t trust it 53 Cpyright 2016 Leland Wilkinsn

References McCullagh, P. and Nelder, J. (1989). Generalized Linear Mdels, Secnd Editin. Bca Ratn: Chapman and HalłCRC. Tukey, J.W. (1969). Analyzing data: Sanctificatin r detective wrk? American Psychlgist, 24, 83-91. 54 Cpyright 2016 Leland Wilkinsn