Introduction to Statistics and R

Size: px

Start display at page:

Download "Introduction to Statistics and R"

Marsha Moody
5 years ago
Views:

1 Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary file for the lecture slides. The.pdf file is created using R Markdown with a.rmd source file, and compiled through RStudio. Note that this will require installing several additional packages. 1 Using R and RStudio Most of the basic R mathematical and statistical functions are self-explanatory. Note that lines with a # in the front are comments, which will not be executed. Lines with in the front are outputs. # Basic mathematical operations [1] 4 3*5 [1] 15 3^5 [1] 243 exp(2) [1] log(3) [1] log2(3) [1] factorial(5) [1] 120 Data objects can be a complicated topic for people who have never used R before. Most common data objects are vector, matrix, list, and data.frame. Operations on vectors are matrices are fairly intuitive. # creating a vector c(1,2,3,4) [1] c("a", "b", "c") [1] "a" "b" "c" 1

2 # creating matrix from a vector matrix(c(1,2,3,4), 2, 2) [,1] [,2] [1,] 1 3 [2,] 2 4 x = c(1,1,1,0,0,0); y = c(1,0,1,0,1,0) cbind(x,y) x y [1,] 1 1 [2,] 1 0 [3,] 1 1 [4,] 0 0 [5,] 0 1 [6,] 0 0 # matrix multiplication using '%*%' matrix(c(1,2,3,4), 2, 2) %*% t(cbind(x,y)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] [2,] Simple mathematical operations on vectors and matrices are usually element-wise. You can easily extract certain elements of them by using the [] operator, like a C programming reference style. # some simple operations x[3] [1] 1 x[2:5] [1] cbind(x,y)[1:2, ] x y [1,] 1 1 [2,] 1 0 (x + y)^2 [1] length(x) [1] 6 dim(cbind(x,y)) [1] 6 2 # A warning will be issued when R detects something wrong. Results may still be produced. x + c(1,2,3,4) Warning in x + c(1, 2, 3, 4): longer object length is not a multiple of shorter object length [1] If you want to see more information about a particular function or operator in R, the easiest way is to get 2

3 the reference document. Put a question mark in front of a function name: # In a default R console window, this will open up a web browser. # In RStudio, this will be displayed at the Help window at the bottom-right penal.?log2?matrix 2 Basic Statistical Concepts Now, let s start to work on some real datasets. Load the Prostate Cancer Data (prostate) from the ElemStatLearn package. R packages are collections of functions and data sets. They are developed and shared by the community and improve the existing base R functionalities. The ElemStatLearn is a package developed for the textbook The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. This dataset comes from a study by Stamey et al. (1989) that examined the prostate-specific antigen (PSA) and some clinical measures in 97 men. If you have never installed this package before, type the following code in your R console window. install.packages("elemstatlearn") Now, let s load this package and have a look at the dataset. library(elemstatlearn) data(prostate) # we will remove the last column since its not used in this analysis prostate = prostate[, -10] dim(prostate) [1] 97 9 head(round(prostate, 3)) lcavol lweight age lbph svi lcp gleason pgg45 lpsa The dataset contains 8 clinical variables and 1 outcome (lpsa). Among them, svi and gleason are categorical variables. Name lcavol lweight age Attribute log cancer volume log prostate weight age in years 3

4 Name Attribute lbph log of benign prostatic hyperplasia amount svi seminal vesicle invasion lcp log of capsular penetration gleason Gleason score pgg45 percent of Gleason scores 4 or 5 lpsa log of prostate specific antigen We can first perform some basic descriptive statistics. The summary function provides a quick overview of the data and each of the variables. This is particularly useful for a continuous variable. The quantiles and mean will be provided for each variable. For categorical variables, we can use the table statement to summarize the count of each category. summary(prostate) lcavol lweight age lbph svi Min. : Min. :2.375 Min. :41.00 Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :3.623 Median :65.00 Median : Median : Mean : Mean :3.629 Mean :63.87 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :4.780 Max. :79.00 Max. : Max. : lcp gleason pgg45 lpsa Min. : Min. :6.000 Min. : 0.00 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :7.000 Median : Median : Mean : Mean :6.753 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :9.000 Max. : Max. : # the table() function can be used for categorical variables table(prostate$svi) table(prostate$gleason) # this statement creates multiple plots in the same row par(mfrow=c(1,2)) 4

5 # mean and sd are standard statistical functions mean(prostate$lcavol) [1] sd(prostate$lcavol) [1] # count frequencies for categorical variables table(prostate$gleason) # we can view a single continous variable using histogram hist(prostate$lcavol) hist(prostate$lpsa) Histogram of prostate$lcavol Histogram of prostate$lpsa Frequency Frequency prostate$lcavol prostate$lpsa For two random variables, we usually use the correlation coefficient to describe how related the two variables are. plot(prostate$lcavol, prostate$lpsa, pch = 19, cex = 0.7, xlab = "cancer volume", ylab = "prostate specific antigen") 5

6 prostate specific antigen cancer volume cor(prostate$lcavol, prostate$lpsa) [1] # correlation matrix can be calculated for each pair of variables round(cor(prostate), 3) lcavol lweight age lbph svi lcp gleason pgg45 lpsa lcavol lweight age lbph svi lcp gleason pgg lpsa Some R packages provide really nice plots of the correlation matrix library(corrplot) corrplot(cor(prostate), type = "upper", tl.col = "black", tl.srt = 35) 6

7 lcavol lweight age lcavol lweight age lbph svi lcpgleason lbph svi lcp gleason pgg45 lpsa pgg45 lpsa pairs(~.,data=prostate[, -c(5, 7)], pch = 19, cex = 0.5, col = "deepskyblue") lcavol lweight age lbph lcp pgg45 lpsa To demonstrate different correlation coefficients, we simulate some random data. # I will set some random seed so that the results can be replicated. set.seed(1) # generate 1000 independent normal data 7

8 n = 1000; X = rnorm(n) par(mfrow = c(2,3), oma = c(2,2,0,0) + 0.1, mar = c(2,1,2,1) + 0.1) # strong positive correlation plot(x, X + rnorm(n, 0, 0.5), pch = 19, cex = 0.5) title("strong Positive Correlation (0.9)") # moderate positive correlation plot(x, X + rnorm(n, 0, 2), pch = 19, cex = 0.5) title("moderate Positive Correlation (0.5)") # zero correlation due to symmetry plot(x, X^2 + rnorm(n, 0, 0.5), pch = 19, cex = 0.5) title("(nearly) Zero Correlation") # zero correlation due to independence plot(x, rnorm(n), pch = 19, cex = 0.5) title("(nearly) Zero Correlation") # moderate negative correlation plot(x, - X + rnorm(n, 0, 2), pch = 19, cex = 0.5) title("moderate Negative Correlation (-0.5)") # strong negative correlation plot(x, - X + rnorm(n, 0, 0.5), pch = 19, cex = 0.5) title("strong Negative Correlation (-0.9)") Strong Positive Correlation (0.9) Moderate Positive Correlation (0.5) (Nearly) Zero Correlation X + rnorm(n, 0, 0.5) X + rnorm(n, 0, 2) X^2 + rnorm(n, 0, 0.5) (Nearly) Zero X Correlation Moderate Negative X Correlation ( 0.5) Strong Negative XCorrelation ( 0.9) rnorm(n) X + rnorm(n, 0, 2) X + rnorm(n, 0, 0.5) X X X 8

9 3 Hypothesis Testings 3.1 Example 1: t test Let s perform a t test for testing the svi indicator vs. the outcome lpsa. t.test(lpsa ~ svi, data = prostate) Welch Two Sample t-test data: lpsa by svi t = , df = , p-value = 7.879e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group 0 mean in group The test shows a p-value less than 10 4, which is considered highly significant. Note that this test can be done in many different ways depends on how your data is constructed. For example, the following approach gives the same result. g1 = prostate$lpsa[prostate$svi == 1] g0 = prostate$lpsa[prostate$svi == 0] t.test(g0, g1) Welch Two Sample t-test data: g0 and g1 t = , df = , p-value = 7.879e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y Example 2: simple linear regression An alternative approach for this question is to use a regressioin model, or ANOVA. We consider modeling the relationship using lpsa β 0 + β 1 svi 9

10 fit = glm(lpsa ~ svi, data = prostate, family = "gaussian") summary(fit) Call: glm(formula = lpsa ~ svi, family = "gaussian", data = prostate) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** svi e-09 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be ) Null deviance: on 96 degrees of freedom Residual deviance: on 95 degrees of freedom AIC: Number of Fisher Scoring iterations: Example 3: multiple linear regression We can further inccorperate more variables in a regression model, for example, lpsa β 0 + β 1 svi + β 2 lcavol + β 3 age fit = glm(lpsa ~ svi + lcavol + age, data = prostate, family = "gaussian") summary(fit) Call: glm(formula = lpsa ~ svi + lcavol + age, family = "gaussian", data = prostate) Deviance Residuals: Min 1Q Median 3Q Max

11 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * svi ** lcavol e-11 *** age Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be ) Null deviance: on 96 degrees of freedom Residual deviance: on 93 degrees of freedom AIC: Number of Fisher Scoring iterations: Example 4: multiple testing Consider a situation that two variables x and y are completely independent. However, as we repeated to perform the test using newly generated samples, some of them will turn out significant. pvalues = rep(na, 1000) for (i in 1:1000) { x = rbinom(20, 1, 0.5) y = rnorm(20) pvalues[i] = t.test(y~x)$p.value } table(pvalues <= 0.05) FALSE TRUE Of course in this example, we are expected to see around 5% of significant results that is how we designed our test. But this may be a big issue in genomic studies, where we need to test thousands of genes one by one. Some solutions may include the Bonferroni correction which lowers the α level by considering the number of hypothesis tests we are performing. 11

12 3.5 Example 4: high-dimensional data Consider a situation that the number of variables is very large. We will have a very unstable system. Too many variables can be significant, and also the parameter estimates are unreliable. When the number of variables is larger than the number of observations, the standard linear model cannot even run. # 55 observations with 50 varaibles set.seed(1) n = 55; p = 50 x = matrix(rnorm(n*p), n, p) # y is independent of x y = rnorm(n) allcoef = summary(lm(y~x))$coefficients # just the varaibles that estimated to be significant allcoef[allcoef[,4] < 0.05, ] Estimate Std. Error t value Pr(> t ) x x x x x x x x x x x Some machine learning models can help in this situation. For example, the Lasso regression. It incorporates a way to focus more on the important variables (if there is any), and try to eliminate variables that are pure noise. This is done using the glmnet package. library(glmnet) lasso.fit = cv.glmnet(x, y) # the lasso model estimates all coefficients to be 0 except the intercept term sum(as.numeric(coef(lasso.fit))!= 0) [1] 1 Some other machine learning models such as random forests can also be used to handle high-dimensional data and estimate the effect of each variable. For more details, please see the documentation of R package randomforest. 12

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro