Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Size: px

Start display at page:

Download "Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model"

Meghan Hopkins
5 years ago
Views:

1 Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model EPSY 905: Multivariate Analysis Lecture 1 20 January 2016 EPSY 905: Lecture 1 - Introduction

2 Today s Class Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance and Covariance Review of the General Linear Model EPSY 905: Lecture 1 - Introduction 2

3 COURSE OVERVIEW EPSY 905: Lecture 1 - Introduction 3

4 Guiding Principles for PRE 905 #1 of 3: Blocks #1. If you understand the building blocks of a model, you can build anything! EPSY 905: Lecture 1 - Introduction 4

5 4 (or 5*) Model Building Blocks 1. Linear models (for effects of predictors) 2. Link functions (for anything not normal) 3a*. Random effects (for describing dependency) 3b*. Latent variables (for measurement models) 4. Estimation (e.g., Maximum Likelihood, Bayesian) * These are really the same thing. EPSY 905: Lecture 1 - Introduction 5

models Ø Mediation and moderation Ø Testing complex hypotheses involving

6 Principles #2 of 3 - The Journey is Part of the Destination Not just blocks; Not just a journey in 905 you will learn: Ø Missing data (impute ) Ø Path models Ø Mediation and moderation Ø Testing complex hypotheses involving observed variables Ø Bayesian Ø Likelihood based methods EPSY 905: Lecture 1 - Introduction 6

7 Guiding Principles for PRE 905 the Bridge: #3 of 3 A bridge between what you know now and advanced statistical methods EPSY 905: Lecture 1 - Introduction 7

8 Motivation for Course Content The goal of this course is to provide you with a fundamental understanding of the underpinnings of the most commonly used contemporary statistical models The course is a combination of topics, picked to make your experience more extendable beyond coursework Some topics are math/statistics heavy Ø Mathematical statistics for the social sciences Upon completion of the course, you will be able to understand the communalities that link methods EPSY 905: Lecture 1 - Introduction 8

9 Course Structure (from the syllabus) Course format is all lecture based Ø Office hours held in labs Homework assignments Ø About one week to complete (Thursday-Thursday, usually) Ø Questions: data analysis, interpretation, some question-and-answer, some from the readings Ø Late penalty: 1 percent per calendar day Ø Revisions are allowed (minus any late penalties) Course Project Ø Three rounds of submissions Ø Iterative process EPSY 905: Lecture 1 - Introduction 9

10 Lecture Format Mix of theory and examples with data and syntax Ø Software: R package EPSY 905: Lecture 1 - Introduction 10

11 REVIEW: BASIC STATISTICAL TOPICS EPSY 905: Lecture 1 - Introduction 11

12 Data for Today s Lecture To help demonstrate the concepts of today s lecture, we will be using a data set with three variables Ø Female (Gender): Male (=0) or Female (=1) Ø Height in inches Ø Weight in pounds The end point of our lecture will be to build a linear model that predicts a person s weight Ø Linear model: a statistical model for an outcome that uses a linear combination (a weighted sum; weighted by a slope) of one or more predictor variables EPSY 905: Lecture 1 - Introduction 12

13 Visualizing the Data EPSY 905: Lecture 1 - Introduction 13

14 Histograms of Height and Weight The weight variable seems to be bimodal should that bother you? (hint: it shouldn t yet) EPSY 905: Lecture 1 - Introduction 14

15 Descriptive Statistics We can summarize each variable marginally through a set of descriptive statistics Ø Marginal: one variable by itself Common marginal descriptive statistics: Ø Central tendency: Mean, Median, Mode Ø Variability: Standard deviation (variance), range We can also summarize the joint (bivariate) distribution of two variables through a set of descriptive statistics: Ø Joint distribution: more than one variable simultaneously Common bivariate descriptive statistics: Ø Correlation and covariance EPSY 905: Lecture 1 - Introduction 15

16 Descriptive Statistics for Height/Weight Data Variable Mean SD Variance Height Weight , Female Diagonal: Variance Above Diagonal: Covariance Correlation /Covariance Height Weight Female Height Weight.798 3, Female Below Diagonal: Correlation EPSY 905: Lecture 1 - Introduction 16

17 Re-examining the Concept of Variance Variability is a central concept in advanced statistics Ø In multivariate statistics, covariance is also central Two formulas for the variance (about the same when N is large):. S $ "# = 1 N 1 ) Y $ +, Y- +,/+. S $ "# = 1 N ) Y $ +, Y- +,/+ Here: p = person; 1 = variable number one Unbiased or sample Biased/ML or population EPSY 905: Lecture 1 - Introduction 17

18 Interpretation of Variance The variance describes the spread of a variable in squared units $ (which come from the Y +, Y- + term in the equation) Variance: the average squared distance of an observation from the mean Ø Variance of Height: inches squared Ø Variance of Weight: 3, pounds squared Ø Variance of Female not applicable in the same way! Because squared units are difficult to work with, we typically use the standard deviation which is reported in units Standard deviation: the average distance of an observation from the mean Ø SD of Height: 7.44 inches Ø SD of Weight: pounds EPSY 905: Lecture 1 - Introduction 18

19 Variance/SD as a More General Statistical Concept Variance (and the standard deviation) is a concept that is applied across statistics not just for data Ø Statistical parameters have variance w e.g. The sample mean Y- + has a standard error (SE) of S " - = 1 2. The standard error is another name for standard deviation Ø So standard error of the mean is equivalent to standard deviation of the mean Ø Usually error refers to parameters; deviation refers to data Ø Variance of the mean would be S " - $ = More generally, variance = error Ø You can think about the SE of the mean as telling you how far off the mean is for describing the data EPSY 905: Lecture 1 - Introduction 19

20 Correlation of Variables Moving from marginal summaries of each variable to joint (bivariate) summaries, the Pearson correlation is often used to describe the association between a pair of variables: 1 r "#," 3 = N 1.,/+ Y +, Y- + Y $, Y- $ S "# S "3 The correlation is unitless as it ranges from -1 to 1 for continuous variables, regardless of their variances Ø Pearson correlation of binary/categorical variables with continuous variables is called a point-biserial (same formula) Ø Pearson correlation of binary/categorical variables with other binary/categorical variables has bounds within -1 and 1 EPSY 905: Lecture 1 - Introduction 20

21 More on the Correlation Coefficient The Pearson correlation is a biased estimator Ø Biased estimator: the expected value differs from the true value for a statistic w Other biased estimators: Variance/SD when + is used. The unbiased correlation estimate would be: r 7 "#," 3 = r "#," r " #," 3 2N Ø As N gets large bias goes away; Bias is largest when r "#," 3 = 0 Ø Pearson is an underestimate of true correlation If it is biased, then why does everyone use it anyway? Ø Answer: forthcoming when we talk about (ML) estimation $ EPSY 905: Lecture 1 - Introduction 21

22 Covariance of Variables: Association with Units The numerator of the correlation coefficient is the covariance of a pair of variables:. S "#," 3 = 1 N 1 ) Y +, Y- + Y $, Y- $,/+. S "#," 3 = 1 N ) Y +, Y- + Y $, Y- $,/+ The covariance uses the units of the original variables (but now they are multiples): Ø Covariance of height and weight: inch-pounds The covariance of a variable with itself is the variance The covariance is often used in multivariate analyses because it ties directly into multivariate distributions Ø But covariance and correlation are easy to switch between EPSY 905: Lecture 1 - Introduction Unbiased or sample Biased/ML or population 22

23 Going from Covariance to Correlation If you have the covariance matrix (variances and covariances): r "#," 3 = S " #," 3 S "# S "3 If you have the correlation matrix and the standard deviations: S "#," 3 = r "#," 3 S "# S "3 EPSY 905: Lecture 1 - Introduction 23

24 THE GENERAL LINEAR MODEL EPSY 905: Lecture 1 - Introduction 24

25 The General Linear Model The general linear model incorporates many different labels of analyses under one unifying umbrella: Categorical X s Continuous X s Both Types of X s Univariate Y ANOVA Regression ANCOVA Multivariate Y s MANOVA Multivariate Regression MANCOVA The typical assumption is that error is normally distributed meaning that the data are conditionally normally distributed Models for non-normal outcomes (e.g., dichotomous, categorical, count) fall under the Generalized Linear Model, of which the GLM is a special case (i.e., for when model residuals can be assumed to be normally distributed) EPSY 905: Lecture 1 - Introduction 25

26 General Linear Models: Conditional Normality Y, = β < + β + X, + β $ Z, + β? X, Z, + e, Model for the Means (Predicted Values): Each person s expected (predicted) outcome is a function of his/her values on x and z (and their interaction) y, x, and z are each measured only once per person (p subscript) Model for the Variance: e, N 0, σ D $ à ONE residual (unexplained) deviation e, has a mean of 0 with some estimated constant variance σ D $, is normally distributed, is unrelated to x and z, and is unrelated across people (across all observations, just people here) We will return to the normal distribution in a few weeks but for now know that it is described by two terms: a mean and a variance EPSY 905: Lecture 1 - Introduction 26

27 Building a Linear Model for Predicting a Person s Weight We will now build a linear model for predicting a person s weight, using height and gender as predictors Several models we will build are done for didactic reasons to show how regression and ANOVA work under the GLM Ø You wouldn t necessarily run these models in this sequence Our beginning model is that of an empty model no predictors for weight (an unconditional model) Our ending model is one with both predictors and their interaction (a conditional model) EPSY 905: Lecture 1 - Introduction 27

28 Model 1: The Empty Model Linear model: Weight, = β < + e, where e, N 0, σ D $ Estimated Parameters: [ESTIMATE (STANDARD ERROR)] Ø β < = (12.61) w Overall intercept the grand mean of weight across all people Just the mean of weight w SE for β < is standard error of the mean for weight 1 RSTUVW. Ø σ $ D = 3,179.1 (SE not given) w The (unbiased) variance of weight: e, = Weight, β < = Weight, Weight,. S $ D = 1 N 1 ) Weight $, Weight,,/+ w From Mean Square Error of F-table EPSY 905: Lecture 1 - Introduction 28

29 Model 2: Predicting Weight from Height ( Regression ) Linear model: Weight, = β < + β + Height, + e, where e, N 0, σ D $ Estimated Parameters: [ESTIMATE (STANDARD ERROR)] Ø β < = (73.483) w Predicted value of Weight for a person with Height = 0 w Nonsensical but we could have centered Height Ø β + = (1.076) w Change in predicted value of Weight for every one-unit increase in height (weight goes up pounds per inch) Ø σ $ D = 1,218 (SE not given) w The residual variance of weight w Height explains?,+\].+^+,$+_ = 61.7% of variance of weight?,+\].+ EPSY 905: Lecture 1 - Introduction 29

30 Model 2a: Predicting Weight from Mean-Centered Height Linear model: W, = β < + β + H, Ha + e, where e, N 0, σ D $ Estimated Parameters: [ESTIMATE (STANDARD ERROR)] Ø β < = (7.804) w Predicted value of Weight for a person with Height = Mean Height w Is the Mean Weight (regression line goes through means) Ø β + = (1.076) w Change in predicted value of Weight for every one-unit increase in height (weight goes up pounds per inch) w Same as previous Ø σ D $ = 1,218 (SE not given) w The residual variance of weight w Height explains?,+\].+^+,$+_?,+\].+ w Same as previous = 61.7% of variance of weight EPSY 905: Lecture 1 - Introduction 30

31 Plotting Model 2a EPSY 905: Lecture 1 - Introduction 31

32 Hypothesis Tests for Parameters To determine if the regression slope is significantly different from zero, we must use a hypothesis test: H < : β + = 0 H + : β + 0 We have two options for this test (both are same in this case) Ø Use ANOVA table: sums of squares F-test Ø Use Wald test for parameter: t = d # ed(d # ) Ø Here t $ = F Wald test: t = d # = g.<h_ ed(d # ) +.<\g = 5.62; p <.001 Conclusion: reject null H < ; slope is significant EPSY 905: Lecture 1 - Introduction 32

33 Model 3: Predicting Weight from Gender ( ANOVA ) Linear Model: Weight, = β < + β $ Female, + e, where e, N 0, σ D $ Note: because gender is a categorical predictor, we must first code it into a number before entering it into the model (typically done automatically in software) Ø Here we use Female = 1 for females; Female = 0 for males Estimated Parameters: [ESTIMATE (STANDARD ERROR)] Ø β < = (5.415) w Predicted value of Weight for a person with Female=0 (males) w Mean weight of males Ø β $ = w t = +<o \.go_ = 13.71; p <.001 w Change in predicted value of Weight for every one unit increase in female w In this case, the difference between the mean for males and the mean for females Ø σ $ D = 293 (SE not given) w The residual variance of weight w Gender explains?,+\].+^$]? = 90.8% of variance of weight?,+\].+ EPSY 905: Lecture 1 - Introduction 33

34 Model 3: More on Categorical Predictors Gender was coded using what is called reference or dummy coding: Ø Intercept becomes mean of the reference group (the 0 group) Ø Slopes become the difference in the means between reference and nonreference groups Ø For C categories, C-1 predictors are created All coding choices can be recovered from the model: Ø Predicted Weight for Females (mean weight for females): W, = β < + β $ = = Ø Predicted Weight for Males: W, = β < = What would β < and β $ be if we coded Male = 1? Ø Super cool idea: what if you could do this in software all at once? EPSY 905: Lecture 1 - Introduction 34

35 Model 3: Predictions and Plots EPSY 905: Lecture 1 - Introduction 35

36 Model 4: Predicting Weight from Height and Gender (w/o Interaction); ( ANCOVA ) Linear Model: W, = β < + β + H, Ha + β $ F, + e, where e, N 0, σ D $ Estimated Parameters: [ESTIMATE (STANDARD ERROR)] Ø β < = (1.439) w Predicted value of Weight for a person with Female=0 (males) and has Height = Mean Height H, Ha = 0 Ø β + = w t = $.\<_ = 17.52;p <.001 <.+oo w Change in predicted value of Weight for every one-unit increase in height (holding gender constant) Ø β $ = w t = _+.\+$ = 36.46;p <.001 $.$h+ w Change in predicted value of Weight for every one-unit increase in female (holding height constant) w In this case, the difference between the mean for males and the mean for females holding height constant Ø σ D $ = 16 (SE not given) w The residual variance of weight EPSY 905: Lecture 1 - Introduction 36

37 Model 4: By-Gender Regression Lines Model 4 assumes identical regression slopes for both genders but has different intercepts Ø This assumption is tested statistically by model 5 Predicted Weight for Females: W, = H, Ha F, = H, Ha Predicted Weight for Males: W, = H, Ha F, = H, Ha EPSY 905: Lecture 1 - Introduction 37

38 Model 4: Predicted Value Regression Lines EPSY 905: Lecture 1 - Introduction 38

39 Model 5: Predicting Weight from Height and Gender (with Interaction); ( ANCOVAish ) Linear Model: W, = β < + β + H, Ha + β $ F, + β? H, Ha F, + e, where e, N 0, σ D $ Estimated Parameters: [ESTIMATE (STANDARD ERROR)] Ø β < = (0.838) w Predicted value of Weight for a person with Female=0 (males) and has Height = Mean Height H, Ha = 0 Ø β + = w t =?.+]< = 28.65;p <.001 <.+++ w Simple main effect of height: Change in predicted value of Weight for every oneunit increase in height (for males only) w A conditional main effect: when interacting variable (gender) = 0 EPSY 905: Lecture 1 - Introduction 39

40 Model 5: Estimated Parameters Estimated Parameters: Ø β $ = (1.211) w t = _$.$\$ = 67.93; p < $++ w Simple main effect of gender: Change in predicted value of Weight for every one unit increase in female, for height = mean height w Gender difference at 67.9 inches Ø β? = w t = +.<]h = 6.52; p <.001 <.+g_ w Gender-by-Height Interaction: Additional change in predicted value of weight for change in either gender or height w Difference in slope for height for females vs. males w Because Female = 1, it modifies the slope for height for females (here the height slope is less positive than for females than for males) Ø σ D $ = 5 (SE not given) EPSY 905: Lecture 1 - Introduction 40

41 Model 5: By-Gender Regression Lines Model 5 does not assume identical regression slopes for both genders Ø Because β? was significantly different from zero, the data supports different slopes for the genders Predicted Weight for Females: W, = H, Ha F, H, Ha F, = H, Ha Predicted Weight for Males: W, = H, Ha F, H, Ha F, = H, Ha EPSY 905: Lecture 1 - Introduction 41

42 Model 5: Predicted Value Regression Lines EPSY 905: Lecture 1 - Introduction 42

43 Comparing Across Models Typically, the empty model and model #5 would be the only models run Ø The trick is to describe the impact of all and each of the predictors typically using variance accounted for (explained) All predictors: Ø Baseline: empty model #1; σ D $ = 3, Ø Comparison: model #5; σ D $ = Ø All predictors (gender, height, interaction)explained?,+\].<]o^h.\?+?,+\].<]o of variance in weight w R $ hall of fame worthy = 99.9% EPSY 905: Lecture 1 - Introduction 43

44 Comparing Across Models The total effect of height (main effect and interaction): Ø Baseline: model #3 (gender only); σ D $ = Ø Comparison: model #5 (all predictors); σ D $ = Ø Height explained $]?.$++^h.\?+ gender $]?.$++ w 98.4% of the % = 9.2% left after gender w True variance accounted for is 98.4%*9.2% = 9.1% = 98.4% of variance in weight remaining after The total effect of gender (main effect and interaction): Ø Baseline: model #2a (height only); σ D $ = 1, Ø Comparison: model #5 (all predictors); σ D $ = Ø Gender explained +,$+\.]\?^h.\?+ +,$+\.]\? after height w 99.6% of the % = 38.3% left after height w True variance accounted for is 99.6%*38.3% = 38.1% = 99.6% of variance in weight remaining EPSY 905: Lecture 1 - Introduction 44

same as saying the conditional distribution of the data given the predictors must be normal Residual:

45 About Weight The distribution of weight was bimodal (shown in the beginning of the class) Ø However, the analysis only called for the residuals to be normally distributed not the actual data Ø This is the same as saying the conditional distribution of the data given the predictors must be normal Residual: e, = Weight, Weight q, = Weight, β < + β + H, Ha + β $ F, + β? H, Ha F, EPSY 905: Lecture 1 - Introduction 45

46 CONCLUDING REMARKS EPSY 905: Lecture 1 - Introduction 46

47 Wrapping Up The general linear model forms the basis for many multivariate statistical techniques Ø Certain features of the model change, but many of the same interpretations remain We will continue to use these terms in more advanced models throughout the rest of the semester Ø Extra practice for linear model terms The trick of linear models is to construct one model that answers all of your research questions EPSY 905: Lecture 1 - Introduction 47

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012