Examples and Limits of the GLM

Similar documents
Statistical Distribution Assumptions of General Linear Models

Probability and Statistics

Discrete Multivariate Statistics

Bayesian Linear Regression [DRAFT - In Progress]

1. (Rao example 11.15) A study measures oxygen demand (y) (on a log scale) and five explanatory variables (see below). Data are available as

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Topic 3: Hypothesis Testing

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Modules 1-2 are background; they are the same for regression analysis and time series.

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /1/2016 1/46

What is a Hypothesis?

appstats27.notebook April 06, 2017

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Sleep data, two drugs Ch13.xls

Chapter Three. Hypothesis Testing

Chapter 27 Summary Inferences for Regression

Do not copy, post, or distribute

Harvard University. Rigorous Research in Engineering Education

Using SPSS for One Way Analysis of Variance

Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research

Review of Statistics 101

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

9 Correlation and Regression

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Hypothesis Testing The basic ingredients of a hypothesis test are

Correlation and Linear Regression

Varieties of Count Data

Longitudinal and Panel Data: Analysis and Applications for the Social Sciences. Table of Contents

Multiple linear regression S6

16.400/453J Human Factors Engineering. Design of Experiments II

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

review session gov 2000 gov 2000 () review session 1 / 38

Lecture Notes 1: Decisions and Data. In these notes, I describe some basic ideas in decision theory. theory is constructed from

Stat 5101 Lecture Notes

Lectures of STA 231: Biostatistics

ANCOVA. ANCOVA allows the inclusion of a 3rd source of variation into the F-formula (called the covariate) and changes the F-formula

y response variable x 1, x 2,, x k -- a set of explanatory variables

Lecture 14. Analysis of Variance * Correlation and Regression. The McGraw-Hill Companies, Inc., 2000

Lecture 14. Outline. Outline. Analysis of Variance * Correlation and Regression Analysis of Variance (ANOVA)

BIOS 312: Precision of Statistical Inference

Review of the General Linear Model

Statistics 3858 : Contingency Tables

Chapter 5 Statistical Inference

Statistics in medicine

Chapter 11. Correlation and Regression

Profile Analysis Multivariate Regression

Psych 230. Psychological Measurement and Statistics

Swarthmore Honors Exam 2012: Statistics

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

2. TRUE or FALSE: Converting the units of one measured variable alters the correlation of between it and a second variable.

8 Analysis of Covariance

Association studies and regression

Lecture 5: ANOVA and Correlation

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Biostatistics 4: Trends and Differences

UNIVERSITY OF MASSACHUSETTS Department of Biostatistics and Epidemiology BioEpi 540W - Introduction to Biostatistics Fall 2004

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

One-Way ANOVA. Some examples of when ANOVA would be appropriate include:

FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE

HYPOTHESIS TESTING. Hypothesis Testing

STA121: Applied Regression Analysis

Example. Multiple Regression. Review of ANOVA & Simple Regression /749 Experimental Design for Behavioral and Social Sciences

Lawrence D. Brown* and Daniel McCarthy*

Statistics 135 Fall 2008 Final Exam

Inferences for Regression

Pig organ transplants within 5 years

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Module 2. General Linear Model

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Panel Data. March 2, () Applied Economoetrics: Topic 6 March 2, / 43

UNIVERSITY OF TORONTO Faculty of Arts and Science

Two-Variable Regression Model: The Problem of Estimation

Mitosis Data Analysis: Testing Statistical Hypotheses By Dana Krempels, Ph.D. and Steven Green, Ph.D.

Introductory Econometrics

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

15.8 MULTIPLE REGRESSION WITH MANY EXPLANATORY VARIABLES

An Introduction to Mplus and Path Analysis

Warm-up Using the given data Create a scatterplot Find the regression line

(1) Introduction to Bayesian statistics

Correlation and regression

Detection theory. H 0 : x[n] = w[n]

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

Appendix A: Review of the General Linear Model

WELCOME! Lecture 14: Factor Analysis, part I Måns Thulin

SC705: Advanced Statistics Instructor: Natasha Sarkisian Class notes: Introduction to Structural Equation Modeling (SEM)

Regression. Oscar García

Rejection regions for the bivariate case

Subject CS1 Actuarial Statistics 1 Core Principles

An Introduction to Path Analysis

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

Regression With a Categorical Independent Variable

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Nonparametric statistic methods. Waraphon Phimpraphai DVM, PhD Department of Veterinary Public Health

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

Generalized Linear Models (GLZ)

Transcription:

Examples and Limits of the GLM Chapter 1 1.1 Motivation 1 1.2 A Review of Basic Statistical Ideas 2 1.3 GLM Definition 4 1.4 GLM Examples 4 1.5 Student Goals 5 1.6 Homework Exercises 5 1.1 Motivation In this chapter we do three things. First, we describe the structure of the book and give suggestions on how a student might use it. Second, we review basic statistical ideas in order to introduce the definition of the General Linear Model (GLM). Third, and finally, we provide examples that demonstrate some of the value of the General Linear Model. This book contains five modules: I. Basic theory (chapters 1 3) II. Multiple regression (chapters 4 6) III. Model building and evaluation (chapters 7 11) IV. ANOVA (chapters 12 15) V. General topics, including ANCOVA and power (chapters 16 17). The first module contains essentially all of the theory for the entire book. Hence the student should not be discouraged or misled by the initial emphasis on formulas. The remainder of the book centers on applications, driven by data and the need to answer particular scientific questions. The beauty and the value of the General Linear Model arise from its wide range of application. As one learns more, one has to remember less. The exercises provide an essential part of this book. The authors believe that effective learning requires completing all of the exercises in this book. Except for those in chapters 1 and 17, the exercises allow students to use data from their own field. Appendix D contains many alternate sets of data. All numeric examples in the text have been computed with simulated data in order to ensure knowledge of the underlying model. In contrast, all exercises use real data, collected in the course of scientific research. Only real data have the richness that creates the wonderful excitement and uncertainty inherent to the scientific process.

2 Regression and ANOVA: An Integrated Approach Using SAS Software 1.2 A Review of Basic Statistical Ideas 1.2.1 Variables A statistical model describes relationships between variables. Thescale of measurement (Stevens 1946, 1951) of a variable constrains and guides statistical analysis and statistical modeling. Four types of scales may be distinguished. Nominal scales have values that only name objects. For example, a person s blood type, A, B, O, and so on, provides a nominal scale. An ordinal scale gives order to objects. For example, the finishing position of runners in a 100m dash only ranks the competitors. An interval scale has values for which distances between them carry meaning. For example, measuring the decrease in light level during a solar eclipse provides an interval scale. A ratio scale has values whose ratios carry meaning. For example, a person s body weight measured in kilograms provides a ratio scale. The interpretation of a zero value reflects the distinction between an interval and ratio scale. For example, measuring temperature in degrees centigrade reflects an interval scale and an arbitrary zero point. In contrast, measuring temperature on the kelvin scale reflects a ratio scale with a true zero, indicating none of the property. Absolute zero (0 kelvin) corresponds to cessation of all molecular motion, while 0 centigrade merely corresponds to the freezing point of water. Velleman and Wilkinson (1993) provided useful discussion and a well-reasoned warning about taking a description of scale too seriously in choosing a data analysis. Certain other descriptors are often associated with the various types of variables: categorical (nominal), rank (ordinal), and continuous (interval or ratio). In the context of statistical analysis, the value of the distinction between interval and ratio scales lies in the customary need to transform ratio scale variables. The error variance of a measurement of biological extent, such as mass, tends to be proportional to the size of the measurement. Consequently, as the range of values for a ratio scale increases, the likelihood of needing to transform the data in order to homogenize variance increases. Building models may involve a number of types of variables. The response variable reflects the outcome to be modeled. Any variable used as input to the model provides a predictor. Such predictors may involve fixed values, like gender, and therefore may be referred to as fixed effects (or fixed predictors). In contrast, predictors with random values, such as familial likeness, are referred to as random effects (or random predictors). In many cases only the impact of such a random effect on the response can be observed. In some cases, such values are assumed to follow a Gaussian distribution with mean zero. Nuisance variables do not lie at the focus of the current research, yet they contribute to predicting the response. Synonyms for nuisance include control (a term often used in the behavioral sciences) and confounder (a term often used in epidemiology). An experiment involves random assignment to a predictor value (called the treatment level), while an observational study (for example, a survey) does not. The terms independent variable and dependent variable are reserved to describe the predictor and response variables for an experiment. The practice of statistics begins with the definition of a population, definedasanyset of interest. Any subset of a population forms a sample. Note that no standard of sampling quality is implied. Everyone in fifth grade in a single city provides a sample of all fifth graders in the nation, but perhaps a misleading one. Similarly, a parameter defines any property of a population, while a statistic defines any property of a sample. The axiomatic mathematical system referred to as probability theory provides the tools needed for describing the likelihood of observing a particular sample (one event in the population) and associated statistic. The validity of such calculations depends strongly on the accuracy of the assumptions about the population, the model. A careful construction of probability theory starts with a detailed consideration of sets and properties of collections of sets. A random variable maps sets (events) into real numbers, and thereby defines a set

Chapter 1 Examples and Limits of the GLM 3 function. As with any function, each input produces only one output, although more than one input may produce the same output. These definitions are intended to remind the reader of the feel of the theory. Many technical issues are ignored here for the sake of brevity. 1.2.2 Statistical Activities One common notation for the study of probability uses Greek letters for parameters, uppercase Roman letters for random variables, and lowercase Roman letters for realizations of random variables (particular sample values). This convention conflicts with the need to distinguish between scalars, vectors, and matrices. The dominance of matrix expressions throughout this book means that the latter requirement dominates. Standard probability notation is used only where it does not conflict with matrix properties. Hence the reader must often distinguish fixed from random, known from unknown, observed from unobserved, and observable from unobservable via the context of the discussion. When in doubt about a particular item, simply search backward in the text to discover where the concept was introduced, or consult the index. Estimation provides a value, ˆθ, the estimate, thought to be a good approximation of a parameter, θ. Many optimality criteria for estimation have been proposed, including consistency ( ˆθ θ as N ), unbiased (E( ˆθ) = θ for any N), least squares ( ˆθ minimizes N i=1 ê 2 i ), and maximum likelihood ( ˆθ maximizes the probability of the sample). The choice of criterion and estimator depends upon the nature of the parameter and the purpose of the analysis. In the context of current statistical practice, testing a hypothesis corresponds to specifying the probability of some guess (hypothesis) about a state of nature (a parameter). Hypothesis testing comprises one part of statistical inference, as does the topic of erecting confidence intervals. Exploratory analysis includes any result that depends on the data at hand. Confirmatory analysis requires choosing the variables, model, and hypothesis test without knowledge of the data at hand. See Muller, Barton, and Benignus 1984 for further discussion of the distinction. Unlike probability theory, all of statistical theory and practice concerns either estimation or inference. Recognition of the importance of personal and institutional values in such activities led to the study of decision theory, which blends mathematics, philosophy (including especially ethics), economics, and psychology. As of this writing, the most common approaches to making decisions with statistical tools may be described as Neyman- Pearson, Bayesian, or Fisherian (each name refers to statisticians identified with the approach). The most popular, the Neyman-Pearson approach, is used throughout this book. Consult Dawid 1983, Fraser 1983, and Koch and Gillings 1983 for further discussion. Each approach has appealing properties and disturbing limitations. Hopefully more general methods will be developed and resolve the conflicts plaguing discussion of statistical practice. Until that time, the reader should recognize that the richness of real data usually provides many paths to the proper conclusion. Statistics provides tools for the scientist. At the end of analysis, the scientist must decide, Does this drug help? Will this airplane fly? The accumulation and quantification of information across studies remains at the heart of science, yet this practice has only recently gained the attention of statisticians. 1.2.3 Error Rates: Evaluating Tests One traditional interpretation of the Neyman-Pearson approach considers four decision probabilities, as represented in Table 1.2.1. Each concerns either the null hypothesis, H 0, or the alternative, H A. This format is mostly ignored in this book. The concepts of type I and type II errors remain useful and in vogue. However, current practice in mathematical

4 Regression and ANOVA: An Integrated Approach Using SAS Software TABLE 1.2.1 Traditional Approach (Not Used in This Text) Truth H 0 : No Effect Decision H A : Effect H 0 :NoEffect Pr{Correct Negative} = Pr{False Positive} = (1 α) Pr{Type I error} =α H A : Effect Pr{False Negative} = Pr{Correct Positive} = Pr{Type II error} =β (1 β) = Power statistics involves a more general definition for power (Hogg and Craig 1995, 285). Statistical power is defined as the probability of rejecting the null hypothesis, whether or not the null is true. For the tests considered in this book, power under the null becomes α. This more general approach has philosophical and mathematical advantages. See chapter 17 for a detailed introduction to power analysis of linear models. 1.3 GLM Definition The name General Linear Model (GLM) represents its features: General indicates the wide applicability of the model to problems of estimation and testing of hypotheses about parameters. Linear refers to the regression function, E[y i row i (X)], being a linear function of the parameters. A model provides a description of the relationship between (one) response and (many) predictor variables. In this book we consider only models with a single response, although such univariate models may include many predictors. The model is a statistical model, expressing observable random variables as a function of unobservable constants and unobservable random variables. The simplest model function has one predictor: The corresponding regression function takes the form y i = β 0 + x i β 1 + e i. (1.3.1) E y i = β 0 + x i β 1. (1.3.2) Note that a function maps each input to one and only one output, although different inputs may lead to the same output. A function may be many to one, or one to one, but not one to many. 1.4 GLM Examples It has been suggested that the distance from one outstretched fingertip to the other equals one s height. This corresponds to the simple statement Wingspan = Height, (1.4.1) which may be seen to be a special case of (1.3.2), with β 0 = 0, β 1 = 1,andnoerror. Consider E y i = x i = 0 + x i 1 = β 0 + x i β 1. (1.4.2)

Chapter 1 Examples and Limits of the GLM 5 Allowing for an unobservable, additive random error, specific to the person, leads to a special case of model (1.3.1), namely, y i = 0 + x i 1 + e i. (1.4.3) An obvious question arises: how might one judge the validity of reducing model (1.3.1) to (1.4.3)? This corresponds to testing H 0 : β 0 = 0andH 0 : β 1 = 1 simultaneously. An obvious start would be to collect observations (Wingspan and Height measurements for each person) and plot the measurement pairs. The simple model described above may be generalized in many ways: Example 2: Add gender as a predictor (which allows distinct intercepts). Example 3: Allow different slopes for different genders. Example 4: Add age as a predictor. Example 5: Use only ethnic origin as a predictor. The corresponding models may be described in the following ways: Model for Example 2: analysis of covariance (ANCOVA). Model for Example 3: full model in every cell. Model for Example 4: multiple regression with two continuous predictors. Model for Example 5: one-way analysis of variance (ANOVA). 1.5 Student Goals This book was written to help the student of linear models achieve two related goals. The first goal seems obvious: learn to analyze and interpret linear models of interval-scale responses. Equally important, the student should strive to understand the associated theory well enough to know when not to apply the methods. The structure of the book helps to meet these goals. In order to provide a compact yet careful treatment of both practice and theory, essentially all proofs are omitted. The text of the book may thus serve as a convenient reference for a data analyst familiar with the GLM. However, in order to learn the material, the reader must also work all of the exercises (except for the few indicated as optional). Although numerous and time consuming, the exercises are condensed and sequenced so as to minimize the total amount of work. Consequently, skipping exercises sometimes makes later ones unnecessarily more difficult. Only data from actual scientific research has been used for exercises, while the few examples in the text all use simulated data. Using simulated data ensures the validity of any claims about the examples. However, simulated data never create the challenge or provide the enjoyment of real data. The exercises are designed to allow substitution of data from the student s own research; likewise, the instructor can tailor the work to student needs. 1.6 Homework Exercises 1.1 Consider consulting with a nephrologist (physician specialized in kidney disease) who asks for assistance in understanding a number of basic concepts in statistics. Throughout this homework, use examples and situations relevant to the person with whom you are consulting.

6 Regression and ANOVA: An Integrated Approach Using SAS Software Give examples of the following types of measurement scales. 1.1.1 Nominal 1.1.2 Ordinal 1.1.3 Interval 1.1.4 Ratio 1.2 The NIDDK (U.S. National Institute of Diabetes, Digestive, and Kidney Diseases) has an ongoing study of the natural history of diabetes. Adults were enrolled at a number of sites, scattered across the United States, with varying demographics. Particular interest centers on the impact of lifestyle and diet. Subjects are examined repeatedly over time, for many years. 1.2.1 Describe the apparent target population of interest for the sponsors. 1.2.2 What samples were taken? 1.2.3 Describe one parameter that seemed of interest to the sponsors. 1.2.4 What statistic might be sensibly used to estimate the parameter? 1.3 Consider a nephrologist studying the impact of adding a new drug to current therapy for the treatment of a glomerulopathy (disease of the filter system in the kidney). The physician decides to measure creatinine level in the blood prior to treatment and again after the treatment, with the aim of reducing creatinine level. 1.3.1 In testing the hypothesis of equal decrement in both groups, describe the decision made (in terms of patient health) if a type I error has been made. 1.3.2 Similarly, describe the decision made if a type II error has been made. 1.3.3 Similarly, explain the power of the test.