Introduction to statistical modeling

Similar documents
Regression Analysis. BUS 735: Business Decision Making and Research. Learn how to detect relationships between ordinal and categorical variables.

DEMAND ESTIMATION (PART III)

Statistics for Managers using Microsoft Excel 6 th Edition

Project Report for STAT571 Statistical Methods Instructor: Dr. Ramon V. Leon. Wage Data Analysis. Yuanlei Zhang

CHAPTER 6: SPECIFICATION VARIABLES

Making sense of Econometrics: Basics

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

Chapter 4. Regression Models. Learning Objectives

Regression Models. Chapter 4. Introduction. Introduction. Introduction

Chapter 13. Multiple Regression and Model Building

Basic Business Statistics 6 th Edition

AUTOCORRELATION. Phung Thanh Binh

Multiple Linear Regression

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

Regression Analysis. BUS 735: Business Decision Making and Research

Chapter 14 Student Lecture Notes 14-1

Chapter 4: Regression Models

FinQuiz Notes

Assumptions of the error term, assumptions of the independent variables

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Chart types and when to use them

Chapter 3 Multiple Regression Complete Example

Course in Data Science

MBA Statistics COURSE #4

Mathematics for Economics MA course

Forecasting. BUS 735: Business Decision Making and Research. exercises. Assess what we have learned

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Regression Analysis By Example

Finding Relationships Among Variables

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

ECON 497: Lecture 4 Page 1 of 1

The simple linear regression model discussed in Chapter 13 was written as

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

LECTURE 11. Introduction to Econometrics. Autocorrelation

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

CHAPTER 5 LINEAR REGRESSION AND CORRELATION

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Statistics Toolbox 6. Apply statistical algorithms and probability models

Sociology 6Z03 Review II

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

ECON 4230 Intermediate Econometric Theory Exam

y response variable x 1, x 2,, x k -- a set of explanatory variables

Econometrics Part Three

Correlation & Simple Regression

STAT 212 Business Statistics II 1

Chapter 16. Simple Linear Regression and Correlation

Taguchi Method and Robust Design: Tutorial and Guideline

Inferences for Regression

Bayesian Analysis LEARNING OBJECTIVES. Calculating Revised Probabilities. Calculating Revised Probabilities. Calculating Revised Probabilities

Introduction to Regression

Business Statistics. Lecture 9: Simple Regression

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Regression Diagnostics Procedures

This document contains 3 sets of practice problems.

INTRODUCTORY REGRESSION ANALYSIS

Interactions. Interactions. Lectures 1 & 2. Linear Relationships. y = a + bx. Slope. Intercept

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

REVIEW 8/2/2017 陈芳华东师大英语系

Multiple Regression. Peerapat Wongchaiwat, Ph.D.

Six Sigma Black Belt Study Guides

G. S. Maddala Kajal Lahiri. WILEY A John Wiley and Sons, Ltd., Publication

Review of Statistics 101

Applied Regression Modeling

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Case Study A Parametric Model for the Cost per Flight Hour (CPFH)

LECTURE 10. Introduction to Econometrics. Multicollinearity & Heteroskedasticity

Bivariate Relationships Between Variables

FORECASTING STANDARDS CHECKLIST

Modeling Spatial Relationships Using Regression Analysis

Statistics for Managers Using Microsoft Excel

REED TUTORIALS (Pty) LTD ECS3706 EXAM PACK

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

Data Analysis and Statistical Methods Statistics 651

Okun's Law Testing Using Modern Statistical Data. Ekaterina Kabanova, Ilona V. Tregub

Chapter 7 Student Lecture Notes 7-1

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

2011 Pearson Education, Inc

FAQ: Linear and Multiple Regression Analysis: Coefficients

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Correlation and Regression (Excel 2007)

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

GIS Analysis: Spatial Statistics for Public Health: Lauren M. Scott, PhD; Mark V. Janikas, PhD

1 The Multiple Regression Model: Freeing Up the Classical Assumptions

Modeling Spatial Relationships using Regression Analysis

Modeling Spatial Relationships Using Regression Analysis. Lauren M. Scott, PhD Lauren Rosenshein Bennett, MS

Eco and Bus Forecasting Fall 2016 EXERCISE 2

The Ins and Outs of Using Dynamic Regression Models for Forecasting

Basic Business Statistics, 10/e

VARIANCE ANALYSIS OF WOOL WOVEN FABRICS TENSILE STRENGTH USING ANCOVA MODEL

Inference with Simple Regression

2 Prediction and Analysis of Variance

Diagnostics of Linear Regression

LECTURE 5. Introduction to Econometrics. Hypothesis testing

The Multiple Regression Model

Multiple Regression Methods

Mathematical Notation Math Introduction to Applied Statistics

Transcription:

Introduction to statistical modeling Illustrated with XLSTAT Jean Paul Maalouf webinar@xlstat.com linkedin.com/in/jean-paul-maalouf November 30, 2016 www.xlstat.com 1

PLAN XLSTAT: who are we? Statistics: categories Reminder: statistical testing Principles of statistical modeling Simple linear regression / ANOVA Principles XLSTAT demo & interpretation of outputs: coefficients, p-values, R² Assumptions about residuals and graphical verification Multiple linear regression Principles & warnings: overfitting & multicolinearity XLSTAT demo & interpretation of outputs What statistical modeling method to choose? Appendix: residuals-alternative verification methods Appendix: alternative modeling tools All the data in this webinar were made up unless otherwise specified 2

XLSTAT: Who are we? XLSTAT is a user-friendly statistical add-on software for Microsoft Excel 3

XLSTAT A growing software and team XLSTAT realizes its first sale on the Internet New version, VBA interface, C++ computations, 7 languages New products, new website, growing and dynamic team 1993 2000 2009 2016 Thierry Fahmy develops a user-friendly solution for data analysis: XLSTAT is born 1996 The company Addinsoft is created 2006 New offers adapted to business needs 2015 XLSTAT 365 Cloud version of XLSTAT for Excel 365 XLSTAT Free Free limited Edition 4

XLSTAT in a few numbers 200+ statistical features General or field-oriented solutions 50k users Across the world. Companies, education, research 16 employees Always receptive to the needs of users 130k visits/month on the website Easy tutorials available in 5 languages 7 languages 400 downloads/day 5

Statistics: 4 categories 6

Statistics: 4 categories Recording Recording Recording Description Exploration Tests Modeling I want to summarize I want to easily extract I want to accept / I want to understand small data sets (1-3 information from a reject a very precise the way a phenomenon variables) using large data set hypothesis assuming evolves according to a simple statistics or without necessarily error risks. (t tests, set of parameters. charts (mean, having a precise ANOVA, correlation (regression, ANOVA, standard deviation, boxplots...) question to answer. (PCA, AHC...) tests, chi-square...) ANCOVA...) 7

Reminder on statistical testing I want to accept / reject a very precise hypothesis assuming error risks. 8

Reminder on statistical testing? Question Are averages A & B the same? The test computes a number called p-value. 0 < p-value < 1 H0 Ha Null Hypothesis Generally implies an idea of equality H0: Average A = Average B Alternative Hypothesis Generally implies an idea of difference Ha: Average A Average B Decision : If p-value < alfa, we reject H0 and accept Ha assuming a risk proportional to p- value of being wrong. 9

Principles of Statistical modeling I want to understand the way variables evolves according to other variables. 10

Principles of Statistical Modeling Definition A statistical model is a simplified representation of a phenomenon using numbers. It allows to better understand reality and to do predictions. 11

A very simple example Somebody asks you: what is the height of French people? First way of answering Recite the whole table, row after row Second way of answering Compute the mean and the standard deviation over the 200 values, and use these two numbers as an answer You have this table that contains height information (cm) of a representative sample of 200 French people. Individual Height Janine 169 Françoise 158 Roger 159 Albert 168 Isabelle 171 Jean-Luc 187 Nicolas 171 Benoît 162...... Representing French people height by a mean and a standard deviation is a way to model this height 12

Principles of Statistical Modeling Definition A statistical model is a simplified representation of a phenomenon using numbers. It allows to better understand reality and to do predictions. How models work technically A model allows to explain one or several dependent variables using one or several independent variables through mathematical equations that involve parameters. The mean and standard deviation model does not imply explanatory variables 13

Simple linear regression Principles, XLSTAT demo, interpretation of outputs, hypotheses on residuals 14

Individuals Data set: online shoe selling platform Variables Question: How does invoice amount vary according to time spent on site? 15

Example: modeling invoice amount according to time spent on website 16

Exemple : modeling invoice amount according to time spent on website We could try simple linear regression (y = a*x + b) Our way to simplify reality: a «straight line» model parameters What we were unable to capture with our model Invoice amount= a*time spent on site + b + residuals Dependent variable Explanatory variable Errors (Residuals) PS: we chose linear modeling, but this was absolutely not mandatory. 17

Salary ANOVA may also be perceived as a statistical model (qualitative explanatory variables) model Model One parameter Salary = average(reference level) + distance(average of the considered level) + residuals Two parameters Reference level Earth Pluto Mars Origin Errors (Residuals) ANOVA, linear regression & ANCOVA are linear models 18

Modeling parameter estimation. The case of simple linear regression The best parameter values are those that minimize the residuals sum of squares: n S a, b = i=1 y i ax i + b 2 Errors (Residuals) Observed Invoice amount (dots) Predicted invoice amount (line) This is what we call Least Square estimation 19

Example: modeling invoice amount according to time spent on website - XLSTAT 20

Example: modeling invoice amount according to time spent on website simple linear regression, XLSTAT outputs Parameter estimations (least squares) Confidence intervals around the estimation b a P-values related to: H 0 : parameter = 0 H a : parameter 0 Equation could be used to predict invoice amount according to new values of time spent on website 21

Example: modeling invoice amount according to time spent on website simple linear regression, XLSTAT outputs R² reflects goodness-of-fit (prefer Adjusted R²). 0<R²<1 Confidence interval of the model (based on parameter estimations) Confidence interval of the predictions (95% of new predictions will lie inside) 22

Linear model Assumptions about residuals A linear model is only reliable under certain conditions associated to residuals 23

Linear model: assumptions about residuals Independence No autocorrelation. One measurement per individual. Normality Residuals should follow a normal distribution. Not too many outliers In general, no more than 5% of outliers among residuals. Homoscedasticity Residuals should have a homogeneous variance. 24

Graphical examination of the assumptions about residuals Residuals vs explanatory variables chart Dots are homogeneously distributed around the y = 0 line model is reliable 25

Normalized residuals Normalized residuals Assumptions about residuals: common patterns of violation Violating the independence assumption ( autocorrelated residuals) Violating the homoscedasticity assumption ( variance heterogeneity) Time Frequently occurs in time series implying periodicity Age Frequently appears when variance is a function of the mean 26

Assumptions about residuals: solutions when violated Think about outliers (eliminate them?) Transform y or x data (log, square root, Box-Cox ) Use a more convenient model (non-linear, Poisson ) Autocorrelation: use the Cochrane-Orcutt model (XLSTAT-Forecast) 27

Multiple linear regression y = a*x 1 + b*x 2 +... 28

Multiple linear regression - principles Investigate the linear influence of several explanatory variables on the dependent variable; increase predictive quality 29

Multiple linear regression - warnings In addition to the assumptions about residuals: beware of overfitting & multicolinearity 30

Adding explanatory variables Multiple linear regression warnings Adding explanatory variables will increase the R² Warning: do not add too many of them To avoid obtaining models that are too fitted on your particular data, and that will consequently be less generalizable. The AIC model quality index builds a compromise between: A good fitting to the data. A low number of parameters. AIC is a relative quality index that should only be used to compare models with each other. The model with the lowest AIC is the best model in the model set. Warning: beware of redundant variables Some correlated explanatory variables may hide each other in terms of effects on the dependent variable. This is called multicolinearity (VIF index > 5). Examples : day temperature & night temperature; weight & height 31

Linear modeling of invoice amount according to a set of variables Multiple linear regression Question: which variables (D-G columns) have the strongest linear influence on invoice amount? Can we predict invoice amount of two new clients? 32

Linear modeling of invoice amount according to a set of variables Multiple linear regression - XLSTAT 33

Linear modeling of invoice amount according to a set of variables Multiple linear regression Examining Multicolinearity High VIF (>5) Redundant variables Solution: exclude one of these 2 variables and re-launch the model 34

Linear modeling of invoice amount according to a set of variables Multiple linear regression excluding height Interpretation : Weight as a significant positive effect on Invoice amount 35

Linear modeling of invoice amount according to a set of variables Prediction 36

According to the type and number of dependent and explanatory variables, several solutions are available What statistical modeling method should you choose? Link: choose an appropriate modeling tool according to your situation 37

Conclusion: Let s get back to this question about height... Different models to answer the same question Somebody asks you: what is the height of French people? Height of French people: dependent variable 4 It depends linearly on age and origin ANCOVA Their height has this average and that 1 standard deviation 5 Normal distribution model It depends linearly on age and father s height Multiple linear regression It depends on geographic origin 2 6 One-way ANOVA It depends on origin and gender 2-way ANOVA It depends linearly on age 3 7 Simple linear regression Etc. etc. Quantitative explanatory var. Qualitative explanatory var. 38

In summary 39

Introduction to statistical modeling - summary Statistical modeling allows to: Investigate how dependent variables evolve according to explanatory variables using a mathematical equation that involves parameters. Predict using this equation Linear models are reliable only under certain assumptions related to residuals: normality, homoscedasticity, absence of autocorrelation & not too many outliers Beware of problems related to the introduction of too many explanatory variables: overfitting & multicollinearity. According to variable types, different models are available. 40

Thanks for attending! All the tools we saw are available in all XLSTAT solutions (except XLSTAT-Free) Survey time 41

Online recording availability of our webinars Until Dec. 16, 2016 42

Appendix: Alternative modeling tools Tables with a high number of explanatory variables (> nb. Of observations) with potentially important multicollinearity: PLS regression Supervised Machine Learning: KNN, Naïve Bayes, SVM (especially for prediction); decision trees 43

Appendix: residuals-alternative verification methods Independence Run a Durbin-Watson test on std. Residuals (XLSTAT-Forecast). Normality Run a normality test on std. Residuals. Not too many outliers Check that not more than 5% of std. residuals are higher than 1.96. Homoscedasticity Run a heteroscedasticity test (Breusch- Pagan or White) on std. residuals. 44