The First Thing You Ever Do When Receive a Set of Data Is

The First Thing You Ever Do When Receive a Set of Data Is Understand the goal of the study What are the objectives of the study? What would the person like to see from the data? Understand the methodology How are samples being collected? Is there any subjectivity in sample collection? Pay attention to nested design, pseudo replication After Understanding the Objectives and Methodology Calculate some summary statistics to help you understand the nature of the data. Usually you can calculate summary statistics for most types of data. mean() mean of a vector mean(,trim) trimmed mean median() median of a vector quantile() sample quantiles at given probabilities range() showing the minimum and the maximum value var() variance of a vector or covariance matrix of a matrix of data frame cov() covariance of two vectors or data frame cor() correlation coefficients of two vectors or data frame mad() median absolute deviation stem() stem and leaf plot summary() summary statistics of a data frame 1

Examples of Summary Statistics quantile() Quantile function needs to vectors as input. The first one contains the observations, and the second one contains probabilities corresponding the quantile. The function returns the empirical quantiles of the first vector Examples of Summary Statistics stem() A stem and leaf plot indicates the distribution of the vector that looks like this 2

Examples of Summary Statistics summary() It is helpful to calculate basic statistics of columns of a data frame Distributional Test Test a vector or multiple vectors whether they conforms to a certain distribution chisq.test() ks.test() t.test() var.test() shapiro.test() wilcox.test() Chi squared goodness of fit test Kolmogorov Smirnov goodness of t test One or two sample Student's t test test on variance equality of x and y Shapiro Wilk test of normality One and two sample Wilcoxon Rank Sum and Signed Rank tests 3

Distributional Test ks.test() This is a versatile test that allows you to test: Whether a data vector is drawn from a certain distribution Whether two data vectors are drawn from the same distribution Intention Regression can be a full course by itself (or even many courses), so, it is not the intention of this class to teach you about regression theory I will just introduce some functions that allow you to perform regression 4

Basic The general structure for regression functions in R consists of a formula object and additional arguments formula objects play a very important role in statistical modeling in R, they are used to specify the model to be fitted. The exact formulation of a formula object depends on the modeling function. Basic However the general form is given by response ~ expression Sometimes the response can be omitted and sometimes the expression is a collection of variables. It is quite flexible in terms of specification 5

Linear Regression Linear regression as we have usually known has the following form Where β 0,, βp are the intercept and p regression coefficients and x 1,, x p are the p regression variables. The error term ε has mean zero and is often modeled as a normal distribution with some variance. The multiple regression function in R is lm(formula, data, weights, subset, na.action) E.g., lm(y~x1+x2+x3+xp, ). Linear Regression The operators in the formula objects have different meanings The : is used to model interaction terms in linear models The * is used as a short hand notation for interaction; however, it includes all combinations of possible interactions up to p order The ^ is used to generate interaction terms up to a certain order 6

Linear Regression The operators in the formula objects have different meanings The - operator is used to leave out terms in a formula. E.g., -1 removes the intercept in a regression formula The function I is used to suppress the specific meaning of the operators in a linear regression model. For example, if you want to include a transformed x 2 variable in your model, say multiplied by 2, the following formula will not work: Linear Regression After you have fitted a linear model, you want to extract a lot of valuable information about the results of model fits. Here are some functions that allow you to extract information on model fits or diagnostics 7

Linear Regression Says that after the diagnostics and you are not satisfied with your model and you would like to make changes, you could write the whole model formulation again. It is much easier just to use the update() function in R to specify the changes that you need to make to your original model. The ~.+Disp construction adds the Disp. variable to whatever model is used in generating the cars.lm object Generalized Linear Model (GLM) Generalized linear model is used to fit a suite of distribution other than Normal distribution such as the common logistic regression The R function for fitting GLM is glm(). The following are the families of distribution that can be fitted using the glm() 8

Non linear Regression The non linear regression model specifies non linear combinations of predictors in the model formulation. It generally has the form of An example is The R function to compute non linear regression is nls(). Mixed effects Modeling For linear mixed effects modeling, there are currently two common packages that allow you to model LMM lme4 package (Bates, Maechler and Bolker): lmer() nlme package (Pinheiro and Bate): lme() For generalized linear mixed effects modeling (GLMM) lme4 package (Bates, Maechler and Bolker): glmer() For non linear mixed effects modeling (NLMM) nlme package (Pinheiro and Bate): nlme() 9

Design of Experiment Experiment allows us to make inference on causal effects between response and predictros due to the way study is setting up By controlling levels of predictors Minimizing effects from external unwanted factors It follows rigid design in order for us to make inference The common way of analyzing experimental data is the variation of Analysis of Variance (ANOVA) The set up of ANOVA is different among different experimental design However, it is important to note that ANOVA is essentially a Liner Regression One way ANOVA One way ANOVA is used when the experiment consists of one factor, which could have multiple levels The hypothesis 10

One way ANOVA Example dataset is a set of 24 blood coagulation times. 24 animals were randomly assigned to four different diets and the samples were taken in a random order (download faraway package from R) One way ANOVA We will find the response of blood coagulation times to different diets 11

One way ANOVA What if we fit the linear model without the intercept? Two way ANOVA An experimental design that involves two way ANOVA have two factors There could be three hypothesis The null hypothesis 1, H o = there is no interaction The null hypothesis 2, H o = there is no effect from factor 1 The null hypothesis 3, H o = there is no effect from factor 2 12

Two way ANOVA Example: 48 rats were allocated to 3 poisons (I, II, III) and 4 treatments (A, B, C, D). The response was survival time in tens of hours Two way ANOVA Fitting the two way ANOVA and checking the fit 13

Two way ANOVA Need to transform the response variable due to the undesired properties of QQ plot and the residuals Two way ANOVA Removing the interaction term since it is insignificant 14

Randomized Complete Block Design (RCBD) Blocking is an effective method removing unwanted and unknown variation (which could not be controlled) We will arrange experimental units into blocks in such a way that the intrablock variation is small but interblock variation is large. A block design can be more effective than the Randomized Complete Design (RCD, which is the one way and two way ANOVA examples) Randomized Complete Block Design (RCBD) Example: we have 4 treatments and have 20 patients available. We could divide the patients into 5 blocks of 4 patients each such that the patients within a block have some relevant similarity. Then we would randomly assign the treatments within a block 15

Randomized Complete Block Design (RCBD) Example: compare 4 processes (A, B, C, D) for production of penicillin. The raw materials, corn steep liquor, is quite variable and can only be made in blends sufficient for 4 runs. So, we block the blends. The null hypothesis H o = there is no differences between the processes. Randomized Complete Block Design (RCBD) Is there interaction between blocks and treatments? 16

Randomized Complete Block Design (RCBD) Let s just assumed that there is no interaction (actually, we are not able to carry out test for interaction, do you know why?) 17