Introduction to statistical modeling Illustrated with XLSTAT Jean Paul Maalouf webinar@xlstat.com linkedin.com/in/jean-paul-maalouf November 30, 2016 www.xlstat.com 1
PLAN XLSTAT: who are we? Statistics: categories Reminder: statistical testing Principles of statistical modeling Simple linear regression / ANOVA Principles XLSTAT demo & interpretation of outputs: coefficients, p-values, R² Assumptions about residuals and graphical verification Multiple linear regression Principles & warnings: overfitting & multicolinearity XLSTAT demo & interpretation of outputs What statistical modeling method to choose? Appendix: residuals-alternative verification methods Appendix: alternative modeling tools All the data in this webinar were made up unless otherwise specified 2
XLSTAT: Who are we? XLSTAT is a user-friendly statistical add-on software for Microsoft Excel 3
XLSTAT A growing software and team XLSTAT realizes its first sale on the Internet New version, VBA interface, C++ computations, 7 languages New products, new website, growing and dynamic team 1993 2000 2009 2016 Thierry Fahmy develops a user-friendly solution for data analysis: XLSTAT is born 1996 The company Addinsoft is created 2006 New offers adapted to business needs 2015 XLSTAT 365 Cloud version of XLSTAT for Excel 365 XLSTAT Free Free limited Edition 4
XLSTAT in a few numbers 200+ statistical features General or field-oriented solutions 50k users Across the world. Companies, education, research 16 employees Always receptive to the needs of users 130k visits/month on the website Easy tutorials available in 5 languages 7 languages 400 downloads/day 5
Statistics: 4 categories 6
Statistics: 4 categories Recording Recording Recording Description Exploration Tests Modeling I want to summarize I want to easily extract I want to accept / I want to understand small data sets (1-3 information from a reject a very precise the way a phenomenon variables) using large data set hypothesis assuming evolves according to a simple statistics or without necessarily error risks. (t tests, set of parameters. charts (mean, having a precise ANOVA, correlation (regression, ANOVA, standard deviation, boxplots...) question to answer. (PCA, AHC...) tests, chi-square...) ANCOVA...) 7
Reminder on statistical testing I want to accept / reject a very precise hypothesis assuming error risks. 8
Reminder on statistical testing? Question Are averages A & B the same? The test computes a number called p-value. 0 < p-value < 1 H0 Ha Null Hypothesis Generally implies an idea of equality H0: Average A = Average B Alternative Hypothesis Generally implies an idea of difference Ha: Average A Average B Decision : If p-value < alfa, we reject H0 and accept Ha assuming a risk proportional to p- value of being wrong. 9
Principles of Statistical modeling I want to understand the way variables evolves according to other variables. 10
Principles of Statistical Modeling Definition A statistical model is a simplified representation of a phenomenon using numbers. It allows to better understand reality and to do predictions. 11
A very simple example Somebody asks you: what is the height of French people? First way of answering Recite the whole table, row after row Second way of answering Compute the mean and the standard deviation over the 200 values, and use these two numbers as an answer You have this table that contains height information (cm) of a representative sample of 200 French people. Individual Height Janine 169 Françoise 158 Roger 159 Albert 168 Isabelle 171 Jean-Luc 187 Nicolas 171 Benoît 162...... Representing French people height by a mean and a standard deviation is a way to model this height 12
Principles of Statistical Modeling Definition A statistical model is a simplified representation of a phenomenon using numbers. It allows to better understand reality and to do predictions. How models work technically A model allows to explain one or several dependent variables using one or several independent variables through mathematical equations that involve parameters. The mean and standard deviation model does not imply explanatory variables 13
Simple linear regression Principles, XLSTAT demo, interpretation of outputs, hypotheses on residuals 14
Individuals Data set: online shoe selling platform Variables Question: How does invoice amount vary according to time spent on site? 15
Example: modeling invoice amount according to time spent on website 16
Exemple : modeling invoice amount according to time spent on website We could try simple linear regression (y = a*x + b) Our way to simplify reality: a «straight line» model parameters What we were unable to capture with our model Invoice amount= a*time spent on site + b + residuals Dependent variable Explanatory variable Errors (Residuals) PS: we chose linear modeling, but this was absolutely not mandatory. 17
Salary ANOVA may also be perceived as a statistical model (qualitative explanatory variables) model Model One parameter Salary = average(reference level) + distance(average of the considered level) + residuals Two parameters Reference level Earth Pluto Mars Origin Errors (Residuals) ANOVA, linear regression & ANCOVA are linear models 18
Modeling parameter estimation. The case of simple linear regression The best parameter values are those that minimize the residuals sum of squares: n S a, b = i=1 y i ax i + b 2 Errors (Residuals) Observed Invoice amount (dots) Predicted invoice amount (line) This is what we call Least Square estimation 19
Example: modeling invoice amount according to time spent on website - XLSTAT 20
Example: modeling invoice amount according to time spent on website simple linear regression, XLSTAT outputs Parameter estimations (least squares) Confidence intervals around the estimation b a P-values related to: H 0 : parameter = 0 H a : parameter 0 Equation could be used to predict invoice amount according to new values of time spent on website 21
Example: modeling invoice amount according to time spent on website simple linear regression, XLSTAT outputs R² reflects goodness-of-fit (prefer Adjusted R²). 0<R²<1 Confidence interval of the model (based on parameter estimations) Confidence interval of the predictions (95% of new predictions will lie inside) 22
Linear model Assumptions about residuals A linear model is only reliable under certain conditions associated to residuals 23
Linear model: assumptions about residuals Independence No autocorrelation. One measurement per individual. Normality Residuals should follow a normal distribution. Not too many outliers In general, no more than 5% of outliers among residuals. Homoscedasticity Residuals should have a homogeneous variance. 24
Graphical examination of the assumptions about residuals Residuals vs explanatory variables chart Dots are homogeneously distributed around the y = 0 line model is reliable 25
Normalized residuals Normalized residuals Assumptions about residuals: common patterns of violation Violating the independence assumption ( autocorrelated residuals) Violating the homoscedasticity assumption ( variance heterogeneity) Time Frequently occurs in time series implying periodicity Age Frequently appears when variance is a function of the mean 26
Assumptions about residuals: solutions when violated Think about outliers (eliminate them?) Transform y or x data (log, square root, Box-Cox ) Use a more convenient model (non-linear, Poisson ) Autocorrelation: use the Cochrane-Orcutt model (XLSTAT-Forecast) 27
Multiple linear regression y = a*x 1 + b*x 2 +... 28
Multiple linear regression - principles Investigate the linear influence of several explanatory variables on the dependent variable; increase predictive quality 29
Multiple linear regression - warnings In addition to the assumptions about residuals: beware of overfitting & multicolinearity 30
Adding explanatory variables Multiple linear regression warnings Adding explanatory variables will increase the R² Warning: do not add too many of them To avoid obtaining models that are too fitted on your particular data, and that will consequently be less generalizable. The AIC model quality index builds a compromise between: A good fitting to the data. A low number of parameters. AIC is a relative quality index that should only be used to compare models with each other. The model with the lowest AIC is the best model in the model set. Warning: beware of redundant variables Some correlated explanatory variables may hide each other in terms of effects on the dependent variable. This is called multicolinearity (VIF index > 5). Examples : day temperature & night temperature; weight & height 31
Linear modeling of invoice amount according to a set of variables Multiple linear regression Question: which variables (D-G columns) have the strongest linear influence on invoice amount? Can we predict invoice amount of two new clients? 32
Linear modeling of invoice amount according to a set of variables Multiple linear regression - XLSTAT 33
Linear modeling of invoice amount according to a set of variables Multiple linear regression Examining Multicolinearity High VIF (>5) Redundant variables Solution: exclude one of these 2 variables and re-launch the model 34
Linear modeling of invoice amount according to a set of variables Multiple linear regression excluding height Interpretation : Weight as a significant positive effect on Invoice amount 35
Linear modeling of invoice amount according to a set of variables Prediction 36
According to the type and number of dependent and explanatory variables, several solutions are available What statistical modeling method should you choose? Link: choose an appropriate modeling tool according to your situation 37
Conclusion: Let s get back to this question about height... Different models to answer the same question Somebody asks you: what is the height of French people? Height of French people: dependent variable 4 It depends linearly on age and origin ANCOVA Their height has this average and that 1 standard deviation 5 Normal distribution model It depends linearly on age and father s height Multiple linear regression It depends on geographic origin 2 6 One-way ANOVA It depends on origin and gender 2-way ANOVA It depends linearly on age 3 7 Simple linear regression Etc. etc. Quantitative explanatory var. Qualitative explanatory var. 38
In summary 39
Introduction to statistical modeling - summary Statistical modeling allows to: Investigate how dependent variables evolve according to explanatory variables using a mathematical equation that involves parameters. Predict using this equation Linear models are reliable only under certain assumptions related to residuals: normality, homoscedasticity, absence of autocorrelation & not too many outliers Beware of problems related to the introduction of too many explanatory variables: overfitting & multicollinearity. According to variable types, different models are available. 40
Thanks for attending! All the tools we saw are available in all XLSTAT solutions (except XLSTAT-Free) Survey time 41
Online recording availability of our webinars Until Dec. 16, 2016 42
Appendix: Alternative modeling tools Tables with a high number of explanatory variables (> nb. Of observations) with potentially important multicollinearity: PLS regression Supervised Machine Learning: KNN, Naïve Bayes, SVM (especially for prediction); decision trees 43
Appendix: residuals-alternative verification methods Independence Run a Durbin-Watson test on std. Residuals (XLSTAT-Forecast). Normality Run a normality test on std. Residuals. Not too many outliers Check that not more than 5% of std. residuals are higher than 1.96. Homoscedasticity Run a heteroscedasticity test (Breusch- Pagan or White) on std. residuals. 44