The First Thing You Ever Do When Receive a Set of Data Is

Similar documents
Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Introduction to Mixed Models in R

R in Linguistic Analysis. Wassink 2012 University of Washington Week 6

3 Joint Distributions 71

Answer Keys to Homework#10

Assignment 9 Answer Keys

WU Weiterbildung. Linear Mixed Models

Chapter 1 Statistical Inference

Chapter 8 (More on Assumptions for the Simple Linear Regression)

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Comparison of two samples

Turning a research question into a statistical question.

My data doesn t look like that..

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

ST4241 Design and Analysis of Clinical Trials Lecture 7: N. Lecture 7: Non-parametric tests for PDG data

SCHOOL OF MATHEMATICS AND STATISTICS

Chapter 4 Multi-factor Treatment Designs with Multiple Error Terms 93

Lectures 5 & 6: Hypothesis Testing

Week 7.1--IES 612-STA STA doc

Textbook Examples of. SPSS Procedure

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

MATH 644: Regression Analysis Methods

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

36-720: Linear Mixed Models

Topic 8. Data Transformations [ST&D section 9.16]

1 Multiple Regression

Statistical inference (estimation, hypothesis tests, confidence intervals) Oct 2018

Logistic Regression in R. by Kerry Machemer 12/04/2015

Exam details. Final Review Session. Things to Review

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

STAT 501 EXAM I NAME Spring 1999

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

Outline. Topic 20 - Diagnostics and Remedies. Residuals. Overview. Diagnostics Plots Residual checks Formal Tests. STAT Fall 2013

STK4900/ Lecture 3. Program

Mixed models with correlated measurement errors

Introduction and Single Predictor Regression. Correlation

One-way ANOVA Model Assumptions

PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES

A brief introduction to mixed models

14 Multiple Linear Regression

Master s Written Examination - Solution

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Impact of serial correlation structures on random effect misspecification with the linear mixed model.

df=degrees of freedom = n - 1

1-Way ANOVA MATH 143. Spring Department of Mathematics and Statistics Calvin College

The ε ij (i.e. the errors or residuals) are normally distributed. This assumption has the least influence on the F test.

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Lecture 9: Linear Regression

Introduction and Background to Multilevel Analysis

PLS205!! Lab 9!! March 6, Topic 13: Covariance Analysis

Rank-Based Methods. Lukas Meier

Weighted Least Squares

the logic of parametric tests

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

K. Model Diagnostics. residuals ˆɛ ij = Y ij ˆµ i N = Y ij Ȳ i semi-studentized residuals ω ij = ˆɛ ij. studentized deleted residuals ɛ ij =

5.3 Three-Stage Nested Design Example

Math 475. Jimin Ding. August 29, Department of Mathematics Washington University in St. Louis jmding/math475/index.

Practical Statistics for the Analytical Scientist Table of Contents

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

Diagnostics for mixed/hierarchical linear models

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

Chapter 18 Resampling and Nonparametric Approaches To Data

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Package blme. August 29, 2016

2. TRUE or FALSE: Converting the units of one measured variable alters the correlation of between it and a second variable.

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Oct Analysis of variance models. One-way anova. Three sheep breeds. Finger ridges. Random and. Fixed effects model. The random effects model

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Correlation in Linear Regression

Topic 6. Two-way designs: Randomized Complete Block Design [ST&D Chapter 9 sections 9.1 to 9.7 (except 9.6) and section 15.8]

2-way analysis of variance

Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research

Linear Probability Model

The Pennsylvania State University The Graduate School Eberly College of Science INTRABLOCK, INTERBLOCK AND COMBINED ESTIMATES

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

Lecture 18: Simple Linear Regression

3rd Quartile. 1st Quartile) Minimum

R Package glmm: Likelihood-Based Inference for Generalized Linear Mixed Models

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

Chapter 12 - Lecture 2 Inferences about regression coefficient

robustness, efficiency, breakdown point, outliers, rank-based procedures, least absolute regression

Statistics. Introduction to R for Public Health Researchers. Processing math: 100%

Subject CS1 Actuarial Statistics 1 Core Principles

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

1. (Rao example 11.15) A study measures oxygen demand (y) (on a log scale) and five explanatory variables (see below). Data are available as

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis

Outline. Statistical inference for linear mixed models. One-way ANOVA in matrix-vector form

Study Sheet. December 10, The course PDF has been updated (6/11). Read the new one.

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

a. When a data set is not normally distributed, what should you try in order to appropriately make statistical tests on that data?

Nonparametric Location Tests: k-sample

Analysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking

Biological Applications of ANOVA - Examples and Readings

Survival Regression Models

Handout 1: Predicting GPA from SAT

Sleep data, two drugs Ch13.xls

Regression in R. Seth Margolis GradQuant May 31,

Transcription:

The First Thing You Ever Do When Receive a Set of Data Is Understand the goal of the study What are the objectives of the study? What would the person like to see from the data? Understand the methodology How are samples being collected? Is there any subjectivity in sample collection? Pay attention to nested design, pseudo replication After Understanding the Objectives and Methodology Calculate some summary statistics to help you understand the nature of the data. Usually you can calculate summary statistics for most types of data. mean() mean of a vector mean(,trim) trimmed mean median() median of a vector quantile() sample quantiles at given probabilities range() showing the minimum and the maximum value var() variance of a vector or covariance matrix of a matrix of data frame cov() covariance of two vectors or data frame cor() correlation coefficients of two vectors or data frame mad() median absolute deviation stem() stem and leaf plot summary() summary statistics of a data frame 1

Examples of Summary Statistics quantile() Quantile function needs to vectors as input. The first one contains the observations, and the second one contains probabilities corresponding the quantile. The function returns the empirical quantiles of the first vector Examples of Summary Statistics stem() A stem and leaf plot indicates the distribution of the vector that looks like this 2

Examples of Summary Statistics summary() It is helpful to calculate basic statistics of columns of a data frame Distributional Test Test a vector or multiple vectors whether they conforms to a certain distribution chisq.test() ks.test() t.test() var.test() shapiro.test() wilcox.test() Chi squared goodness of fit test Kolmogorov Smirnov goodness of t test One or two sample Student's t test test on variance equality of x and y Shapiro Wilk test of normality One and two sample Wilcoxon Rank Sum and Signed Rank tests 3

Distributional Test ks.test() This is a versatile test that allows you to test: Whether a data vector is drawn from a certain distribution Whether two data vectors are drawn from the same distribution Intention Regression can be a full course by itself (or even many courses), so, it is not the intention of this class to teach you about regression theory I will just introduce some functions that allow you to perform regression 4

Basic The general structure for regression functions in R consists of a formula object and additional arguments formula objects play a very important role in statistical modeling in R, they are used to specify the model to be fitted. The exact formulation of a formula object depends on the modeling function. Basic However the general form is given by response ~ expression Sometimes the response can be omitted and sometimes the expression is a collection of variables. It is quite flexible in terms of specification 5

Linear Regression Linear regression as we have usually known has the following form Where β 0,, βp are the intercept and p regression coefficients and x 1,, x p are the p regression variables. The error term ε has mean zero and is often modeled as a normal distribution with some variance. The multiple regression function in R is lm(formula, data, weights, subset, na.action) E.g., lm(y~x1+x2+x3+xp, ). Linear Regression The operators in the formula objects have different meanings The : is used to model interaction terms in linear models The * is used as a short hand notation for interaction; however, it includes all combinations of possible interactions up to p order The ^ is used to generate interaction terms up to a certain order 6

Linear Regression The operators in the formula objects have different meanings The - operator is used to leave out terms in a formula. E.g., -1 removes the intercept in a regression formula The function I is used to suppress the specific meaning of the operators in a linear regression model. For example, if you want to include a transformed x 2 variable in your model, say multiplied by 2, the following formula will not work: Linear Regression After you have fitted a linear model, you want to extract a lot of valuable information about the results of model fits. Here are some functions that allow you to extract information on model fits or diagnostics 7

Linear Regression Says that after the diagnostics and you are not satisfied with your model and you would like to make changes, you could write the whole model formulation again. It is much easier just to use the update() function in R to specify the changes that you need to make to your original model. The ~.+Disp construction adds the Disp. variable to whatever model is used in generating the cars.lm object Generalized Linear Model (GLM) Generalized linear model is used to fit a suite of distribution other than Normal distribution such as the common logistic regression The R function for fitting GLM is glm(). The following are the families of distribution that can be fitted using the glm() 8

Non linear Regression The non linear regression model specifies non linear combinations of predictors in the model formulation. It generally has the form of An example is The R function to compute non linear regression is nls(). Mixed effects Modeling For linear mixed effects modeling, there are currently two common packages that allow you to model LMM lme4 package (Bates, Maechler and Bolker): lmer() nlme package (Pinheiro and Bate): lme() For generalized linear mixed effects modeling (GLMM) lme4 package (Bates, Maechler and Bolker): glmer() For non linear mixed effects modeling (NLMM) nlme package (Pinheiro and Bate): nlme() 9

Design of Experiment Experiment allows us to make inference on causal effects between response and predictros due to the way study is setting up By controlling levels of predictors Minimizing effects from external unwanted factors It follows rigid design in order for us to make inference The common way of analyzing experimental data is the variation of Analysis of Variance (ANOVA) The set up of ANOVA is different among different experimental design However, it is important to note that ANOVA is essentially a Liner Regression One way ANOVA One way ANOVA is used when the experiment consists of one factor, which could have multiple levels The hypothesis 10

One way ANOVA Example dataset is a set of 24 blood coagulation times. 24 animals were randomly assigned to four different diets and the samples were taken in a random order (download faraway package from R) One way ANOVA We will find the response of blood coagulation times to different diets 11

One way ANOVA What if we fit the linear model without the intercept? Two way ANOVA An experimental design that involves two way ANOVA have two factors There could be three hypothesis The null hypothesis 1, H o = there is no interaction The null hypothesis 2, H o = there is no effect from factor 1 The null hypothesis 3, H o = there is no effect from factor 2 12

Two way ANOVA Example: 48 rats were allocated to 3 poisons (I, II, III) and 4 treatments (A, B, C, D). The response was survival time in tens of hours Two way ANOVA Fitting the two way ANOVA and checking the fit 13

Two way ANOVA Need to transform the response variable due to the undesired properties of QQ plot and the residuals Two way ANOVA Removing the interaction term since it is insignificant 14

Randomized Complete Block Design (RCBD) Blocking is an effective method removing unwanted and unknown variation (which could not be controlled) We will arrange experimental units into blocks in such a way that the intrablock variation is small but interblock variation is large. A block design can be more effective than the Randomized Complete Design (RCD, which is the one way and two way ANOVA examples) Randomized Complete Block Design (RCBD) Example: we have 4 treatments and have 20 patients available. We could divide the patients into 5 blocks of 4 patients each such that the patients within a block have some relevant similarity. Then we would randomly assign the treatments within a block 15

Randomized Complete Block Design (RCBD) Example: compare 4 processes (A, B, C, D) for production of penicillin. The raw materials, corn steep liquor, is quite variable and can only be made in blends sufficient for 4 runs. So, we block the blends. The null hypothesis H o = there is no differences between the processes. Randomized Complete Block Design (RCBD) Is there interaction between blocks and treatments? 16

Randomized Complete Block Design (RCBD) Let s just assumed that there is no interaction (actually, we are not able to carry out test for interaction, do you know why?) 17