Nemours Biomedical Research Biostatistics Core Statistics Course Session 4 Li Xie March 4, 2015
Outline Recap: Pairwise analysis with example of twosample unpaired t-test Today: More on t-tests; Introduction to correlation
Describing two variables, numerically Possible scenarios: 1. Continuous and categorical (eg age by gender) Stratum-specific descriptive statistics of the continuous variable (ie mean(sd) age among females/males) Session 3 2. Continuous and continuous (eg BMI by age) Pearson product-moment correlation coefficient (r), Spearman's rank correlation coefficient (rho) Today 3. Categorical and categorical (eg race by gender) Odds ratio, etc SESSION 5
Revisiting boxplot and density plot
Hypothesis test Quality of inferences based any statistics (parametric or nonparametric) is influenced by how well the data meet the assumptions of the statistics All statistics have assumptions Parametric statistics assume the data come from a population that follows a known probability distribution.
Hypothesis Test By convention, Null hypothesis (H0) states no difference E.g.: caffeine intake does not differ by blood type (A,B,O, AB) Alternative simple hypotheses are in 1 of the 3 forms E.g.: A have higher intake than non-a (one-sided) A have lower intake than non-a (one-sided) A have different intake than non-a (two-sided) Alternative composite hypothesis, an example: A have higher intake than O but lower intake than AB. P-value: the probability that assuming H0 is true, test statistic obtained is at least as extreme as the one observed from the data set in hand Significance level: the probability below which H0 is rejected.
Hypothesis Test - Procedure Empirical distribution Test statistic Theoretical distribution Theoretical result Empirical inference Pooled standard deviation Variance of group 1
Conceptualization of the t test statistic t t group 1mean group 2 mean combined va riabilityof both groups where most data points in group 1are where most data points in group overall, how far do the data points inboth groups spread Assumptions: continuous variable, simple random sample, distribution of data has no major departure from normality. 2 are
The Importance of Standard Deviation In all three cases, the difference between the population means is the same, but with large variability of data around their respective means (left), the difference between two groups may well come by chance. On the other hand, with small variability (right), the difference is more precise. The smaller the variability, the larger the magnitude of the t-value and therefore, the smaller the p-value.
Relationship between 2 quantitative variables Scatterplot carries 3 types of information about the relationship between 2 quantitative variables: 1. Linearity of relationship 2. Strength of relationship 3. Direction of relationship Alternatively (to scatterplot), such information could be conveyed numerically by simple correlation coefficients.
Correlation Correlation is a measure of the quantitative relationship between variables. The calculation of statistical correlation does NOT need scientific basis between X and Y. Some simple popular correlation coefficients: Pearson product-moment correlation coefficient Spearman s correlation coefficient
Pearson s Correlation Coefficient A unitless measure of the LINEAR correlation between two variables X and Y, -1 Pearson s corr 1. Interpretation: 1 total positive linear correlation ( direct correlation ) 0 no linear correlation 1 total negative linear correlation ( inverse correlation ) Pearson s correlation Pearson s correlation cov( x, y) X Y Covariance of X and Y Standard deviation of X Standard deviation of Y Pearson s correlation How x changes as y changes Variability of x Variability of y
Visualization Pearson s correlation??? Pearson s correlation?
Pearson s Correlation in Excel Then hit Enter
Assumption of Pearson s Correlation X and Y are bivariate normal A reasonably linear relationship exists