Introduction to Statistics and R

Size: px
Start display at page:

Download "Introduction to Statistics and R"

Transcription

1 Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary file for the lecture slides. The.pdf file is created using R Markdown with a.rmd source file, and compiled through RStudio. Note that this will require installing several additional packages. 1 Using R and RStudio Most of the basic R mathematical and statistical functions are self-explanatory. Note that lines with a # in the front are comments, which will not be executed. Lines with in the front are outputs. # Basic mathematical operations [1] 4 3*5 [1] 15 3^5 [1] 243 exp(2) [1] log(3) [1] log2(3) [1] factorial(5) [1] 120 Data objects can be a complicated topic for people who have never used R before. Most common data objects are vector, matrix, list, and data.frame. Operations on vectors are matrices are fairly intuitive. # creating a vector c(1,2,3,4) [1] c("a", "b", "c") [1] "a" "b" "c" 1

2 # creating matrix from a vector matrix(c(1,2,3,4), 2, 2) [,1] [,2] [1,] 1 3 [2,] 2 4 x = c(1,1,1,0,0,0); y = c(1,0,1,0,1,0) cbind(x,y) x y [1,] 1 1 [2,] 1 0 [3,] 1 1 [4,] 0 0 [5,] 0 1 [6,] 0 0 # matrix multiplication using '%*%' matrix(c(1,2,3,4), 2, 2) %*% t(cbind(x,y)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] [2,] Simple mathematical operations on vectors and matrices are usually element-wise. You can easily extract certain elements of them by using the [] operator, like a C programming reference style. # some simple operations x[3] [1] 1 x[2:5] [1] cbind(x,y)[1:2, ] x y [1,] 1 1 [2,] 1 0 (x + y)^2 [1] length(x) [1] 6 dim(cbind(x,y)) [1] 6 2 # A warning will be issued when R detects something wrong. Results may still be produced. x + c(1,2,3,4) Warning in x + c(1, 2, 3, 4): longer object length is not a multiple of shorter object length [1] If you want to see more information about a particular function or operator in R, the easiest way is to get 2

3 the reference document. Put a question mark in front of a function name: # In a default R console window, this will open up a web browser. # In RStudio, this will be displayed at the Help window at the bottom-right penal.?log2?matrix 2 Basic Statistical Concepts Now, let s start to work on some real datasets. Load the Prostate Cancer Data (prostate) from the ElemStatLearn package. R packages are collections of functions and data sets. They are developed and shared by the community and improve the existing base R functionalities. The ElemStatLearn is a package developed for the textbook The Elements of Statistical Learning, Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. This dataset comes from a study by Stamey et al. (1989) that examined the prostate-specific antigen (PSA) and some clinical measures in 97 men. If you have never installed this package before, type the following code in your R console window. install.packages("elemstatlearn") Now, let s load this package and have a look at the dataset. library(elemstatlearn) data(prostate) # we will remove the last column since its not used in this analysis prostate = prostate[, -10] dim(prostate) [1] 97 9 head(round(prostate, 3)) lcavol lweight age lbph svi lcp gleason pgg45 lpsa The dataset contains 8 clinical variables and 1 outcome (lpsa). Among them, svi and gleason are categorical variables. Name lcavol lweight age Attribute log cancer volume log prostate weight age in years 3

4 Name Attribute lbph log of benign prostatic hyperplasia amount svi seminal vesicle invasion lcp log of capsular penetration gleason Gleason score pgg45 percent of Gleason scores 4 or 5 lpsa log of prostate specific antigen We can first perform some basic descriptive statistics. The summary function provides a quick overview of the data and each of the variables. This is particularly useful for a continuous variable. The quantiles and mean will be provided for each variable. For categorical variables, we can use the table statement to summarize the count of each category. summary(prostate) lcavol lweight age lbph svi Min. : Min. :2.375 Min. :41.00 Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :3.623 Median :65.00 Median : Median : Mean : Mean :3.629 Mean :63.87 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :4.780 Max. :79.00 Max. : Max. : lcp gleason pgg45 lpsa Min. : Min. :6.000 Min. : 0.00 Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median :7.000 Median : Median : Mean : Mean :6.753 Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. :9.000 Max. : Max. : # the table() function can be used for categorical variables table(prostate$svi) table(prostate$gleason) # this statement creates multiple plots in the same row par(mfrow=c(1,2)) 4

5 # mean and sd are standard statistical functions mean(prostate$lcavol) [1] sd(prostate$lcavol) [1] # count frequencies for categorical variables table(prostate$gleason) # we can view a single continous variable using histogram hist(prostate$lcavol) hist(prostate$lpsa) Histogram of prostate$lcavol Histogram of prostate$lpsa Frequency Frequency prostate$lcavol prostate$lpsa For two random variables, we usually use the correlation coefficient to describe how related the two variables are. plot(prostate$lcavol, prostate$lpsa, pch = 19, cex = 0.7, xlab = "cancer volume", ylab = "prostate specific antigen") 5

6 prostate specific antigen cancer volume cor(prostate$lcavol, prostate$lpsa) [1] # correlation matrix can be calculated for each pair of variables round(cor(prostate), 3) lcavol lweight age lbph svi lcp gleason pgg45 lpsa lcavol lweight age lbph svi lcp gleason pgg lpsa Some R packages provide really nice plots of the correlation matrix library(corrplot) corrplot(cor(prostate), type = "upper", tl.col = "black", tl.srt = 35) 6

7 lcavol lweight age lcavol lweight age lbph svi lcpgleason lbph svi lcp gleason pgg45 lpsa pgg45 lpsa pairs(~.,data=prostate[, -c(5, 7)], pch = 19, cex = 0.5, col = "deepskyblue") lcavol lweight age lbph lcp pgg45 lpsa To demonstrate different correlation coefficients, we simulate some random data. # I will set some random seed so that the results can be replicated. set.seed(1) # generate 1000 independent normal data 7

8 n = 1000; X = rnorm(n) par(mfrow = c(2,3), oma = c(2,2,0,0) + 0.1, mar = c(2,1,2,1) + 0.1) # strong positive correlation plot(x, X + rnorm(n, 0, 0.5), pch = 19, cex = 0.5) title("strong Positive Correlation (0.9)") # moderate positive correlation plot(x, X + rnorm(n, 0, 2), pch = 19, cex = 0.5) title("moderate Positive Correlation (0.5)") # zero correlation due to symmetry plot(x, X^2 + rnorm(n, 0, 0.5), pch = 19, cex = 0.5) title("(nearly) Zero Correlation") # zero correlation due to independence plot(x, rnorm(n), pch = 19, cex = 0.5) title("(nearly) Zero Correlation") # moderate negative correlation plot(x, - X + rnorm(n, 0, 2), pch = 19, cex = 0.5) title("moderate Negative Correlation (-0.5)") # strong negative correlation plot(x, - X + rnorm(n, 0, 0.5), pch = 19, cex = 0.5) title("strong Negative Correlation (-0.9)") Strong Positive Correlation (0.9) Moderate Positive Correlation (0.5) (Nearly) Zero Correlation X + rnorm(n, 0, 0.5) X + rnorm(n, 0, 2) X^2 + rnorm(n, 0, 0.5) (Nearly) Zero X Correlation Moderate Negative X Correlation ( 0.5) Strong Negative XCorrelation ( 0.9) rnorm(n) X + rnorm(n, 0, 2) X + rnorm(n, 0, 0.5) X X X 8

9 3 Hypothesis Testings 3.1 Example 1: t test Let s perform a t test for testing the svi indicator vs. the outcome lpsa. t.test(lpsa ~ svi, data = prostate) Welch Two Sample t-test data: lpsa by svi t = , df = , p-value = 7.879e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group 0 mean in group The test shows a p-value less than 10 4, which is considered highly significant. Note that this test can be done in many different ways depends on how your data is constructed. For example, the following approach gives the same result. g1 = prostate$lpsa[prostate$svi == 1] g0 = prostate$lpsa[prostate$svi == 0] t.test(g0, g1) Welch Two Sample t-test data: g0 and g1 t = , df = , p-value = 7.879e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y Example 2: simple linear regression An alternative approach for this question is to use a regressioin model, or ANOVA. We consider modeling the relationship using lpsa β 0 + β 1 svi 9

10 fit = glm(lpsa ~ svi, data = prostate, family = "gaussian") summary(fit) Call: glm(formula = lpsa ~ svi, family = "gaussian", data = prostate) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) < 2e-16 *** svi e-09 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be ) Null deviance: on 96 degrees of freedom Residual deviance: on 95 degrees of freedom AIC: Number of Fisher Scoring iterations: Example 3: multiple linear regression We can further inccorperate more variables in a regression model, for example, lpsa β 0 + β 1 svi + β 2 lcavol + β 3 age fit = glm(lpsa ~ svi + lcavol + age, data = prostate, family = "gaussian") summary(fit) Call: glm(formula = lpsa ~ svi + lcavol + age, family = "gaussian", data = prostate) Deviance Residuals: Min 1Q Median 3Q Max

11 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * svi ** lcavol e-11 *** age Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for gaussian family taken to be ) Null deviance: on 96 degrees of freedom Residual deviance: on 93 degrees of freedom AIC: Number of Fisher Scoring iterations: Example 4: multiple testing Consider a situation that two variables x and y are completely independent. However, as we repeated to perform the test using newly generated samples, some of them will turn out significant. pvalues = rep(na, 1000) for (i in 1:1000) { x = rbinom(20, 1, 0.5) y = rnorm(20) pvalues[i] = t.test(y~x)$p.value } table(pvalues <= 0.05) FALSE TRUE Of course in this example, we are expected to see around 5% of significant results that is how we designed our test. But this may be a big issue in genomic studies, where we need to test thousands of genes one by one. Some solutions may include the Bonferroni correction which lowers the α level by considering the number of hypothesis tests we are performing. 11

12 3.5 Example 4: high-dimensional data Consider a situation that the number of variables is very large. We will have a very unstable system. Too many variables can be significant, and also the parameter estimates are unreliable. When the number of variables is larger than the number of observations, the standard linear model cannot even run. # 55 observations with 50 varaibles set.seed(1) n = 55; p = 50 x = matrix(rnorm(n*p), n, p) # y is independent of x y = rnorm(n) allcoef = summary(lm(y~x))$coefficients # just the varaibles that estimated to be significant allcoef[allcoef[,4] < 0.05, ] Estimate Std. Error t value Pr(> t ) x x x x x x x x x x x Some machine learning models can help in this situation. For example, the Lasso regression. It incorporates a way to focus more on the important variables (if there is any), and try to eliminate variables that are pure noise. This is done using the glmnet package. library(glmnet) lasso.fit = cv.glmnet(x, y) # the lasso model estimates all coefficients to be 0 except the intercept term sum(as.numeric(coef(lasso.fit))!= 0) [1] 1 Some other machine learning models such as random forests can also be used to handle high-dimensional data and estimate the effect of each variable. For more details, please see the documentation of R package randomforest. 12

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015 Lecture 8: Fitting Data Statistical Computing, 36-350 Wednesday October 7, 2015 In previous episodes Loading and saving data sets in R format Loading and saving data sets in other structured formats Intro

More information

Using R in 200D Luke Sonnet

Using R in 200D Luke Sonnet Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random

More information

Sampling Distributions

Sampling Distributions Merlise Clyde Duke University September 3, 2015 Outline Topics Normal Theory Chi-squared Distributions Student t Distributions Readings: Christensen Apendix C, Chapter 1-2 Prostate Example > library(lasso2);

More information

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Fadel Hamid Hadi Alhusseini Department of Statistics and Informatics, University

More information

Lecture 4: Newton s method and gradient descent

Lecture 4: Newton s method and gradient descent Lecture 4: Newton s method and gradient descent Newton s method Functional iteration Fitting linear regression Fitting logistic regression Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech

More information

Holiday Assignment PS 531

Holiday Assignment PS 531 Holiday Assignment PS 531 Prof: Jake Bowers TA: Paul Testa January 27, 2014 Overview Below is a brief assignment for you to complete over the break. It should serve as refresher, covering some of the basic

More information

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

Some new ideas for post selection inference and model assessment

Some new ideas for post selection inference and model assessment Some new ideas for post selection inference and model assessment Robert Tibshirani, Stanford WHOA!! 2018 Thanks to Jon Taylor and Ryan Tibshirani for helpful feedback 1 / 23 Two topics 1. How to improve

More information

COMP 551 Applied Machine Learning Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen June 6, 2013 1 Motivation Problem: Many clinical covariates which are important to a certain medical

More information

Sampling Distributions

Sampling Distributions Merlise Clyde Duke University September 8, 2016 Outline Topics Normal Theory Chi-squared Distributions Student t Distributions Readings: Christensen Apendix C, Chapter 1-2 Prostate Example > library(lasso2);

More information

High-dimensional data analysis

High-dimensional data analysis High-dimensional data analysis HW3 Reproduce Figure 3.8 3.10 and Table 3.3. (do not need PCR PLS Std Error) Figure 3.8 There is Profiles of ridge coefficients for the prostate cancer example, as the tuning

More information

Consider fitting a model using ordinary least squares (OLS) regression:

Consider fitting a model using ordinary least squares (OLS) regression: Example 1: Mating Success of African Elephants In this study, 41 male African elephants were followed over a period of 8 years. The age of the elephant at the beginning of the study and the number of successful

More information

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks! Linear Regression Machine Learning 10-601 Seyoung Kim Many of these slides are derived from Tom Mitchell. Thanks! Regression So far, we ve been interested in learning P(Y X) where Y has discrete values

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

Checking the Poisson assumption in the Poisson generalized linear model

Checking the Poisson assumption in the Poisson generalized linear model Checking the Poisson assumption in the Poisson generalized linear model The Poisson regression model is a generalized linear model (glm) satisfying the following assumptions: The responses y i are independent

More information

Post-selection inference with an application to internal inference

Post-selection inference with an application to internal inference Post-selection inference with an application to internal inference Robert Tibshirani, Stanford University November 23, 2015 Seattle Symposium in Biostatistics, 2015 Joint work with Sam Gross, Will Fithian,

More information

Exercise 5.4 Solution

Exercise 5.4 Solution Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, 2010 1 5.4(a) > leukemia

More information

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Poisson Regression 1 / 49 Poisson Regression 1 Introduction

More information

Statistical Prediction

Statistical Prediction Statistical Prediction P.R. Hahn Fall 2017 1 Some terminology The goal is to use data to find a pattern that we can exploit. y: response/outcome/dependent/left-hand-side x: predictor/covariate/feature/independent

More information

Variable Selection in Low and High-Dimensional Data Analysis. May 15, 2016

Variable Selection in Low and High-Dimensional Data Analysis. May 15, 2016 Variable Selection in Low and High-Dimensional Data Analysis May 15, 2016 Workshop Description 1 Petty 219 from 9:00AM-12:30PM 2 Presentation I (9:00AM-10:00PM) 3 Presentation II (10:20AM-11:20AM) 4 Presentation

More information

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson ) Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

Post-selection Inference for Forward Stepwise and Least Angle Regression

Post-selection Inference for Forward Stepwise and Least Angle Regression Post-selection Inference for Forward Stepwise and Least Angle Regression Ryan & Rob Tibshirani Carnegie Mellon University & Stanford University Joint work with Jonathon Taylor, Richard Lockhart September

More information

Density Temp vs Ratio. temp

Density Temp vs Ratio. temp Temp Ratio Density 0.00 0.02 0.04 0.06 0.08 0.10 0.12 Density 0.0 0.2 0.4 0.6 0.8 1.0 1. (a) 170 175 180 185 temp 1.0 1.5 2.0 2.5 3.0 ratio The histogram shows that the temperature measures have two peaks,

More information

Logistic Regression 21/05

Logistic Regression 21/05 Logistic Regression 21/05 Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression

More information

On the Inference of the Logistic Regression Model

On the Inference of the Logistic Regression Model On the Inference of the Logistic Regression Model 1. Model ln =(; ), i.e. = representing false. The linear form of (;) is entertained, i.e. ((;)) ((;)), where ==1 ;, with 1 representing true, 0 ;= 1+ +

More information

A significance test for the lasso

A significance test for the lasso 1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

Introduction to the Analysis of Tabular Data

Introduction to the Analysis of Tabular Data Introduction to the Analysis of Tabular Data Anthropological Sciences 192/292 Data Analysis in the Anthropological Sciences James Holland Jones & Ian G. Robertson March 15, 2006 1 Tabular Data Is there

More information

Introduction and Single Predictor Regression. Correlation

Introduction and Single Predictor Regression. Correlation Introduction and Single Predictor Regression Dr. J. Kyle Roberts Southern Methodist University Simmons School of Education and Human Development Department of Teaching and Learning Correlation A correlation

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression Introduction to the Generalized Linear Model: Logistic regression and Poisson regression Statistical modelling: Theory and practice Gilles Guillot gigu@dtu.dk November 4, 2013 Gilles Guillot (gigu@dtu.dk)

More information

Interactions in Logistic Regression

Interactions in Logistic Regression Interactions in Logistic Regression > # UCBAdmissions is a 3-D table: Gender by Dept by Admit > # Same data in another format: > # One col for Yes counts, another for No counts. > Berkeley = read.table("http://www.utstat.toronto.edu/~brunner/312f12/

More information

Leftovers. Morris. University Farm. University Farm. Morris. yield

Leftovers. Morris. University Farm. University Farm. Morris. yield Leftovers SI 544 Lada Adamic 1 Trellis graphics Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)

More information

Generalized Linear Models in R

Generalized Linear Models in R Generalized Linear Models in R NO ORDER Kenneth K. Lopiano, Garvesh Raskutti, Dan Yang last modified 28 4 2013 1 Outline 1. Background and preliminaries 2. Data manipulation and exercises 3. Data structures

More information

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications 1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model Stat 3302 (Spring 2017) Peter F. Craigmile Simple linear logistic regression (part 1) [Dobson and Barnett, 2008, Sections 7.1 7.3] Generalized linear models for binary data Beetles dose-response example

More information

Package plw. R topics documented: May 7, Type Package

Package plw. R topics documented: May 7, Type Package Type Package Package plw May 7, 2018 Title Probe level Locally moderated Weighted t-tests. Version 1.40.0 Date 2009-07-22 Author Magnus Astrand Maintainer Magnus Astrand

More information

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model 1 Linear Regression 2 Linear Regression In this lecture we will study a particular type of regression model: the linear regression model We will first consider the case of the model with one predictor

More information

ST430 Exam 1 with Answers

ST430 Exam 1 with Answers ST430 Exam 1 with Answers Date: October 5, 2015 Name: Guideline: You may use one-page (front and back of a standard A4 paper) of notes. No laptop or textook are permitted but you may use a calculator.

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

An introduction to biostatistics: part 1

An introduction to biostatistics: part 1 An introduction to biostatistics: part 1 Cavan Reilly September 6, 2017 Table of contents Introduction to data analysis Uncertainty Probability Conditional probability Random variables Discrete random

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Logistic Regression. 0.1 Frogs Dataset

Logistic Regression. 0.1 Frogs Dataset Logistic Regression We move now to the classification problem from the regression problem and study the technique ot logistic regression. The setting for the classification problem is the same as that

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

Regression III: Computing a Good Estimator with Regularization

Regression III: Computing a Good Estimator with Regularization Regression III: Computing a Good Estimator with Regularization -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 Another way to choose the model Let (X 0, Y 0 ) be a new observation

More information

Chapter 16: Understanding Relationships Numerical Data

Chapter 16: Understanding Relationships Numerical Data Chapter 16: Understanding Relationships Numerical Data These notes reflect material from our text, Statistics, Learning from Data, First Edition, by Roxy Peck, published by CENGAGE Learning, 2015. Linear

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

R Output for Linear Models using functions lm(), gls() & glm()

R Output for Linear Models using functions lm(), gls() & glm() LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base

More information

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths

Cuckoo Birds. Analysis of Variance. Display of Cuckoo Bird Egg Lengths Cuckoo Birds Analysis of Variance Bret Larget Departments of Botany and of Statistics University of Wisconsin Madison Statistics 371 29th November 2005 Cuckoo birds have a behavior in which they lay their

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Module 4: Regression Methods: Concepts and Applications

Module 4: Regression Methods: Concepts and Applications Module 4: Regression Methods: Concepts and Applications Example Analysis Code Rebecca Hubbard, Mary Lou Thompson July 11-13, 2018 Install R Go to http://cran.rstudio.com/ (http://cran.rstudio.com/) Click

More information

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 Packages................................ 2 2 Hospital infection risk data (some).................

More information

Statistics. Introduction to R for Public Health Researchers. Processing math: 100%

Statistics. Introduction to R for Public Health Researchers. Processing math: 100% Statistics Introduction to R for Public Health Researchers Statistics Now we are going to cover how to perform a variety of basic statistical tests in R. Correlation T-tests/Rank-sum tests Linear Regression

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression INTRODUCTION TO CLINICAL RESEARCH Introduction to Linear Regression Karen Bandeen-Roche, Ph.D. July 17, 2012 Acknowledgements Marie Diener-West Rick Thompson ICTR Leadership / Team JHU Intro to Clinical

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

Inferences on Linear Combinations of Coefficients

Inferences on Linear Combinations of Coefficients Inferences on Linear Combinations of Coefficients Note on required packages: The following code required the package multcomp to test hypotheses on linear combinations of regression coefficients. If you

More information

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction

EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction EECS E6690: Statistical Learning for Biological and Information Systems Lecture1: Introduction Prof. Predrag R. Jelenković Time: Tuesday 4:10-6:40pm 1127 Seeley W. Mudd Building Dept. of Electrical Engineering

More information

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/ Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to

More information

Regression and Models with Multiple Factors. Ch. 17, 18

Regression and Models with Multiple Factors. Ch. 17, 18 Regression and Models with Multiple Factors Ch. 17, 18 Mass 15 20 25 Scatter Plot 70 75 80 Snout-Vent Length Mass 15 20 25 Linear Regression 70 75 80 Snout-Vent Length Least-squares The method of least

More information

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont. TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples. Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression

More information

Solutions to obligatorisk oppgave 2, STK2100

Solutions to obligatorisk oppgave 2, STK2100 Solutions to obligatorisk oppgave 2, STK2100 Vinnie Ko May 14, 2018 Disclaimer: This document is made solely for my own personal use and can contain many errors. Oppgave 1 We load packages and read data

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102 Background Regression so far... Lecture 21 - Sta102 / BME102 Colin Rundel November 18, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical

More information

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

MODULE 6 LOGISTIC REGRESSION. Module Objectives: MODULE 6 LOGISTIC REGRESSION Module Objectives: 1. 147 6.1. LOGIT TRANSFORMATION MODULE 6. LOGISTIC REGRESSION Logistic regression models are used when a researcher is investigating the relationship between

More information

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression 36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 22, 2015 Lecture 4: Linear Regression TCELL Simple Regression Example Male black wheatear birds carry stones to the nest as a form

More information

Week 7 Multiple factors. Ch , Some miscellaneous parts

Week 7 Multiple factors. Ch , Some miscellaneous parts Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires

More information

Lecture 5 - Plots and lines

Lecture 5 - Plots and lines Lecture 5 - Plots and lines Understanding magic Let us look at the following curious thing: =rnorm(100) y=rnorm(100,sd=0.1)+ k=ks.test(,y) k Two-sample Kolmogorov-Smirnov test data: and y D = 0.05, p-value

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

A brief introduction to mixed models

A brief introduction to mixed models A brief introduction to mixed models University of Gothenburg Gothenburg April 6, 2017 Outline An introduction to mixed models based on a few examples: Definition of standard mixed models. Parameter estimation.

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 7 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, Colonic

More information

How to work correctly statistically about sex ratio

How to work correctly statistically about sex ratio How to work correctly statistically about sex ratio Marc Girondot Version of 12th April 2014 Contents 1 Load packages 2 2 Introduction 2 3 Confidence interval of a proportion 4 3.1 Pourcentage..............................

More information

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model Lab 3 A Quick Introduction to Multiple Linear Regression Psychology 310 Instructions.Work through the lab, saving the output as you go. You will be submitting your assignment as an R Markdown document.

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Introduction to Linear Regression

Introduction to Linear Regression Introduction to Linear Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Introduction to Linear Regression 1 / 46

More information

Non-Gaussian Response Variables

Non-Gaussian Response Variables Non-Gaussian Response Variables What is the Generalized Model Doing? The fixed effects are like the factors in a traditional analysis of variance or linear model The random effects are different A generalized

More information

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website. SLR output RLS Refer to slr (code) on the Lecture Page of the class website. Old Faithful at Yellowstone National Park, WY: Simple Linear Regression (SLR) Analysis SLR analysis explores the linear association

More information

Regression on Faithful with Section 9.3 content

Regression on Faithful with Section 9.3 content Regression on Faithful with Section 9.3 content The faithful data frame contains 272 obervational units with variables waiting and eruptions measuring, in minutes, the amount of wait time between eruptions,

More information

Bias-free Sparse Regression with Guaranteed Consistency

Bias-free Sparse Regression with Guaranteed Consistency Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March

More information

Regression Methods for Survey Data

Regression Methods for Survey Data Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear

More information

Chapter 8 Conclusion

Chapter 8 Conclusion 1 Chapter 8 Conclusion Three questions about test scores (score) and student-teacher ratio (str): a) After controlling for differences in economic characteristics of different districts, does the effect

More information

Regression in R. Seth Margolis GradQuant May 31,

Regression in R. Seth Margolis GradQuant May 31, Regression in R Seth Margolis GradQuant May 31, 2018 1 GPA What is Regression Good For? Assessing relationships between variables This probably covers most of what you do 4 3.8 3.6 3.4 Person Intelligence

More information