pensim Package Example (Version 1.2.9)

Size: px
Start display at page:

Download "pensim Package Example (Version 1.2.9)"

Transcription

1 pensim Package Example (Version 1.2.9) Levi Waldron March 13, 2014 Contents 1 Introduction 1 2 Example data 2 3 Nested cross-validation Summarization and plotting Getting coefficients for model fit on all the data 4 5 Simulation of collinear high-dimensional data with survival or binary outcome Binary outcome example Survival outcome example Introduction This package acts as a wrapper to the penalized R package to add the following functionality to that package: ˆ repeated tuning on different data folds, with parallelization to take advantage of multiple processors ˆ two-dimensional tuning of the Elastic Net ˆ calculation of cross-validated risk scores through a nested cross-validation procedure, with parallelization to take advantage of multiple processors It also provides a function for simulation of collinear high-dimensional data with survival or binary response. This package was developed in support of the study by Waldron et al.[3]. This paper contains greater detail on proper application of the methods provided here. Please cite this paper when using the pensim package in your research, as well as the penalized package[2]. 1

2 2 Example data pensim provides example data from a microarray experiment investigating survival of cancer patients with lung adenocarcinomas (Beer et al., 2002)[1]. Load this data and do an initial pre-filter of genes with low IQR: > library(pensim) > data(beer.exprs) > data(beer.survival) > ##select just 100 genes to speed computation, just for the sake of example: > set.seed(1) > beer.exprs.sample <- beer.exprs[sample(1:nrow(beer.exprs), 100), ] > # > gene.quant <- apply(beer.exprs.sample, 1, quantile, probs=0.75) > dat.filt <- beer.exprs.sample[gene.quant>log2(100), ] > gene.iqr <- apply(dat.filt, 1, IQR) > dat.filt <- as.matrix(dat.filt[gene.iqr>0.5, ]) > dat.filt <- t(dat.filt) > dat.filt <- data.frame(dat.filt) > # > library(survival) > surv.obj <- Surv(beer.survival$os, beer.survival$status) Note that the expression data are in wide format, with one column per predictor (gene). It is recommended to put covariate data in a dataframe object, rather than a matrix. 3 Nested cross-validation Unbiased estimation of prediction accuracy involves two levels of cross-validation: an outer level for estimating prediction accuracy, and an inner level for model tuning. This process is simplified by the opt.nested.crossval function. It is recommended first to establish the arguments for the penalized regression by testing on the penalized package functions optl1 (for LASSO), optl2 (for Ridge) or cvl (for Elastic Net). Here we use LASSO. Setting maxlambda1=5 is not a generally recommended procedure, but is useful in this toy example to avoid converging on the null model. > library(penalized) > testfit <- optl1(response=surv.obj, + penalized=dat.filt, + fold=5, + maxlambda1=5, + positive=false, + standardize=true, + trace=false) 2

3 Now pass these arguments to opt.nested.crossval() for cross-validated calculation and assessment of risk scores, with the additional arguments: ˆ outerfold and nprocessors: number of folds for the outer level of crossvalidation, and the number of processors to use for the outer level of cross-validation (see?opt.nested.crossval) ˆ optfun and scaling: which optimization function to use (opt1d for LASSO or Ridge, opt2d for Elastic Net) - see?opt.splitval. ˆ setpen and nsim: setpen defines whether to do LASSO ( L1 ) or Ridge ( L2 ), nsim defines the number of times to repeat tuning (see?opt1d. opt2d has different required arguments.) > set.seed(1) > preds <- opt.nested.crossval(outerfold=5, nprocessors=1, #opt.nested.crossval arguments + optfun="opt1d", scaling=false, #opt.splitval arguments + setpen="l1", nsim=1, #opt1d arguments + response=surv.obj, #rest are penalized::optl1 arguments + penalized=dat.filt, + fold=5, + positive=false, + standardize=true, + trace=false) Ideally nsim would be 50, and outerfold and fold would be 10, but the values below speed computation 200x compared to these recommended values. Note that here we are using the standardize=true argument of optl1 rather than the scaling=true argument of opt.splitval. These two approaches to scaling are roughly equivalent, but the scaling approaches are not the same (scaling=true does z-score, standardize=true scales to unit central L2 norm), and results will not be identical. Also, using standardize=true scales variables but provides coeffients for the original scale, whereas using scaling=true scales variables in the training set then applies the same scales to the test set. 3.1 Summarization and plotting Cox fit on the continuous risk predictions: > coxfit.continuous <- coxph(surv.obj~preds) > summary(coxfit.continuous) Call: coxph(formula = surv.obj ~ preds) n= 86, number of events= 24 coef exp(coef) se(coef) z Pr(> z ) 3

4 preds exp(coef) exp(-coef) lower.95 upper.95 preds Concordance= (se = ) Rsquare= (max possible= ) Likelihood ratio test= 2.01 on 1 df, p= Wald test = 2.22 on 1 df, p= Score (logrank) test = 2.3 on 1 df, p= Dichotomize the cross-validated risk predictions at the median, for visualization: > preds.dichot <- preds > median(preds) Plot the ROC curve: 4 Getting coefficients for model fit on all the data Finally, we can get coefficients for the model fit on all the data, for future use. Note that nsim should ideally be greater than 1, to train the model using multiple foldings for cross-validation. The output of opt1d or opt2d will be a matrix with one row per simulation. The default behavior in opt.nested.crossval() is to take the simulation with highest cross-validated partial log likelihood (CVL), which is the recommended way to select a model from the multiple simulations. > beer.coefs <- opt1d(setpen="l1",nsim=1, + response=surv.obj, + penalized=dat.filt, + fold=5, + maxlambda1=5, + positive=false, + standardize=true, + trace=false) We can also include unpenalized covariates, if desired. Note that when keeping only one variable for a penalized or unpenalized covariate, indexing a dataframe like [1] instead of doing [, 1] preserves the variable name. With [, 1] the variable name gets converted to. > beer.coefs.unpen <- opt1d(setpen="l1", nsim=1, + response=surv.obj, + penalized=dat.filt[-1], # This is equivalent to dat.filt[,-1] + unpenalized=dat.filt[1], 4

5 > if(require(survivalroc)){ + nobs <- length(preds) + cutoff < preds.roc <- survivalroc(stime=beer.survival$os, status=beer.survival$status, + marker=preds, predict.time=cutoff, span = 0.01*nobs^(-0.20)) + plot(preds.roc$fp, preds.roc$tp, type="l", xlim=c(0, 1), ylim=c(0, 1), + xlab=paste( "FP", "\n", "AUC = ", round(preds.roc$auc, 3)), + lty=2, + ylab="tp", main="lasso predictions\n ROC curve at 12 months") + abline(0, 1) + } LASSO predictions ROC curve at 12 months TP FP AUC = Figure 1: ROC plot of cross-validated continuous risk predictions at 12 months. Note that the predictions are better if you don t randomly select 250 genes to start with! We only did this to ease the load on the CRAN checking servers. 5

6 + fold=5, + maxlambda1=5, + positive=false, + standardize=true, + trace=false) Note the non-zero first coefficient this time, due to it being unpenalized: > beer.coefs[1, 1:5] #example output with no unpenalized covariates L1 cvl U21931_at Y00285_s_at S78187_at > beer.coefs.unpen[1, 1:5] #example output with first covariate unpenalized L1 cvl U21931_at Y00285_s_at S78187_at Simulation of collinear high-dimensional data with survival or binary outcome The pensim also provides a convenient means to simulation high-dimensional expression data with (potentially censored) survival outcome or binary outcome which is dependent on specified covariates. 5.1 Binary outcome example First, generate the data. Here we simulate 20 variables. The first 15 (group a ) are uncorrelated, and have no association with outcome. The final five (group b ) have covariance of 0.8 to each other variable in that group. The response variable is associated with the first variable group b (firstonly=true) with a coefficient of 2. Binary outcomes for n s = 50 samples are simulated as a Bernoulli distribution with probability for patient s: p s = exp( βx s ) with β s,16 = 0.5 and all other β s,i equal to zero. The code for this simulation is as follows: (1) > set.seed(9) > x <- create.data(nvars=c(15, 5), 6

7 + cors=c(0, 0.8), + associations=c(0, 2), + firstonly=c(true, TRUE), + nsamples=50, + response="binary", + logisticintercept=0.5) Take a look at the simulated data: > summary(x) Length Class Mode summary 6 data.frame list associations 20 -none- numeric covariance 400 -none- numeric data 21 data.frame list > x$summary start end cors associations num firstonly a TRUE b TRUE A simple logistic model fails at variable selection in this case: > simplemodel <- glm(outcome ~., data=x$data, family=binomial) > summary(simplemodel) Call: glm(formula = outcome ~., family = binomial, data = x$data) Deviance Residuals: Min 1Q Median 3Q Max e e e e e-04 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) a a a a a a a a a

8 a a a a a a b b b b b (Dispersion parameter for binomial family taken to be 1) Null deviance: e+01 on 49 degrees of freedom Residual deviance: e-07 on 29 degrees of freedom AIC: 42 Number of Fisher Scoring iterations: 25 But LASSO does a better job, selecting several of the collinear variables in the b group of variables which are associated with outcome: > lassofit <- opt1d(nsim=3, nprocessors=1, setpen="l1", penalized=x$data[1:20], + response=x$data[, "outcome"], trace=false, fold=10) > print(lassofit) L1 cvl (Intercept) a.1 a.2 [1,] [2,] [3,] a.3 a.4 a.5 a.6 a.7 a.8 a.9 [1,] [2,] [3,] a.10 a.11 a.12 a.13 a.14 a.15 b.1 [1,] [2,] [3,] b.2 b.3 b.4 b.5 [1,] [2,] [3,] And visualize the data as a heatmap: 8

9 > dat <- t(as.matrix(x$data[, -match("outcome", colnames(x$data))])) > heatmap(dat, ColSideColors=ifelse(x$data$outcome==0, "black", "white")) a.3 a.2 a.12 a.5 a.10 a.6 a.9 a.1 a.15 a.8 a.14 a.13 a.11 a.4 a.7 b.5 b.4 b.3 b.1 b Figure 2: Heatmap of simulated data with binary response. 9

10 5.2 Survival outcome example We simulate these data in the same way, but with response= timetoevent. Here censoring is uniform random between times 2 and 10, generating approximately 34% censoring: > set.seed(1) > x <- create.data(nvars=c(15, 5), + cors=c(0, 0.8), + associations=c(0, 0.5), + firstonly=c(true, TRUE), + nsamples=50, + censoring=c(2, 10), + response="timetoevent") How many events are censored? > sum(x$data$cens==0)/nrow(x$data) [1] 0.42 Kaplan-Meier plot of this simulated cohort: 10

11 > library(survival) > surv.obj <- Surv(x$data$time, x$data$cens) > plot(survfit(surv.obj~1), ylab="survival probability", xlab="time") Survival probability time Figure 3: Kaplan-Meier plot of survival of simulated cohort. 11

12 References [1] David G. Beer, Sharon L.R. Kardia, Chiang-Ching Huang, Thomas J. Giordano, Albert M. Levin, David E. Misek, Lin Lin, Guoan Chen, Tarek G. Gharib, Dafydd G. Thomas, Michelle L. Lizyness, Rork Kuick, Satoru Hayasaka, Jeremy M.G. Taylor, Mark D. Iannettoni, Mark B. Orringer, and Samir Hanash. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med, 8(8): , [2] Jelle J Goeman. L1 penalized estimation in the cox proportional hazards model. Biometrical Journal. Biometrische Zeitschrift, 52(1):70 84, February PMID: [3] Levi Waldron, Melania Pintilie, Ming-Sound Tsao, Frances A Shepherd, Curtis Huttenhower, and Igor Jurisica. Optimized application of penalized regression methods to diverse genomic data. Bioinformatics, 27(24): , PMID: > sessioninfo() R Under development (unstable) ( r65113) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grdevices utils [6] datasets methods base other attached packages: [1] survivalroc_1.0.3 penalized_ survival_ [4] pensim_1.2.9 loaded via a namespace (and not attached): [1] MASS_ parallel_3.1.0 tools_

CGEN(Case-control.GENetics) Package

CGEN(Case-control.GENetics) Package CGEN(Case-control.GENetics) Package October 30, 2018 > library(cgen) Example of snp.logistic Load the ovarian cancer data and print the first 5 rows. > data(xdata, package="cgen") > Xdata[1:5, ] id case.control

More information

samplesizelogisticcasecontrol Package

samplesizelogisticcasecontrol Package samplesizelogisticcasecontrol Package January 31, 2017 > library(samplesizelogisticcasecontrol) Random data generation functions Let X 1 and X 2 be two variables with a bivariate normal ditribution with

More information

Matched Pair Data. Stat 557 Heike Hofmann

Matched Pair Data. Stat 557 Heike Hofmann Matched Pair Data Stat 557 Heike Hofmann Outline Marginal Homogeneity - review Binary Response with covariates Ordinal response Symmetric Models Subject-specific vs Marginal Model conditional logistic

More information

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 NBA attendance data........................ 2 2 Regression model for NBA attendances...............

More information

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam Booklet of Code and Output for STAD29/STA 1007 Midterm Exam List of Figures in this document by page: List of Figures 1 Packages................................ 2 2 Hospital infection risk data (some).................

More information

β j = coefficient of x j in the model; β = ( β1, β2,

β j = coefficient of x j in the model; β = ( β1, β2, Regression Modeling of Survival Time Data Why regression models? Groups similar except for the treatment under study use the nonparametric methods discussed earlier. Groups differ in variables (covariates)

More information

Package penalized. February 21, 2018

Package penalized. February 21, 2018 Version 0.9-50 Date 2017-02-01 Package penalized February 21, 2018 Title L1 (Lasso and Fused Lasso) and L2 (Ridge) Penalized Estimation in GLMs and in the Cox Model Author Jelle Goeman, Rosa Meijer, Nimisha

More information

Survival Regression Models

Survival Regression Models Survival Regression Models David M. Rocke May 18, 2017 David M. Rocke Survival Regression Models May 18, 2017 1 / 32 Background on the Proportional Hazards Model The exponential distribution has constant

More information

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson ) Tutorial 7: Correlation and Regression Correlation Used to test whether two variables are linearly associated. A correlation coefficient (r) indicates the strength and direction of the association. A correlation

More information

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times

Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Part [1.0] Measures of Classification Accuracy for the Prediction of Survival Times Patrick J. Heagerty PhD Department of Biostatistics University of Washington 1 Biomarkers Review: Cox Regression Model

More information

Logistic Regression 21/05

Logistic Regression 21/05 Logistic Regression 21/05 Recall that we are trying to solve a classification problem in which features x i can be continuous or discrete (coded as 0/1) and the response y is discrete (0/1). Logistic regression

More information

Relative-risk regression and model diagnostics. 16 November, 2015

Relative-risk regression and model diagnostics. 16 November, 2015 Relative-risk regression and model diagnostics 16 November, 2015 Relative risk regression More general multiplicative intensity model: Intensity for individual i at time t is i(t) =Y i (t)r(x i, ; t) 0

More information

Multivariable Fractional Polynomials

Multivariable Fractional Polynomials Multivariable Fractional Polynomials Axel Benner September 7, 2015 Contents 1 Introduction 1 2 Inventory of functions 1 3 Usage in R 2 3.1 Model selection........................................ 3 4 Example

More information

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis Lecture 6: Logistic Regression Analysis Christopher S. Hollenbeak, PhD Jane R. Schubart, PhD The Outcomes Research Toolbox Review Homework 2 Overview Logistic regression model conceptually Logistic regression

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to

More information

Survival analysis in R

Survival analysis in R Survival analysis in R Niels Richard Hansen This note describes a few elementary aspects of practical analysis of survival data in R. For further information we refer to the book Introductory Statistics

More information

Logistic & Tobit Regression

Logistic & Tobit Regression Logistic & Tobit Regression Different Types of Regression Binary Regression (D) Logistic transformation + e P( y x) = 1 + e! " x! + " x " P( y x) % ln$ ' = ( + ) x # 1! P( y x) & logit of P(y x){ P(y

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

The NanoStringDiff package

The NanoStringDiff package The NanoStringDiff package Hong Wang 1, Chi Wang 2,3* 1 Department of Statistics, University of Kentucky,Lexington, KY; 2 Markey Cancer Center, University of Kentucky, Lexington, KY ; 3 Department of Biostatistics,

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Survival analysis in R

Survival analysis in R Survival analysis in R Niels Richard Hansen This note describes a few elementary aspects of practical analysis of survival data in R. For further information we refer to the book Introductory Statistics

More information

Multivariable Fractional Polynomials

Multivariable Fractional Polynomials Multivariable Fractional Polynomials Axel Benner May 17, 2007 Contents 1 Introduction 1 2 Inventory of functions 1 3 Usage in R 2 3.1 Model selection........................................ 3 4 Example

More information

Visualising very long data vectors with the Hilbert curve Description of the Bioconductor packages HilbertVis and HilbertVisGUI.

Visualising very long data vectors with the Hilbert curve Description of the Bioconductor packages HilbertVis and HilbertVisGUI. Visualising very long data vectors with the Hilbert curve Description of the Bioconductor packages HilbertVis and HilbertVisGUI. Simon Anders European Bioinformatics Institute, Hinxton, Cambridge, UK sanders@fs.tum.de

More information

Package covtest. R topics documented:

Package covtest. R topics documented: Package covtest February 19, 2015 Title Computes covariance test for adaptive linear modelling Version 1.02 Depends lars,glmnet,glmpath (>= 0.97),MASS Author Richard Lockhart, Jon Taylor, Ryan Tibshirani,

More information

Generalized Linear Models in R

Generalized Linear Models in R Generalized Linear Models in R NO ORDER Kenneth K. Lopiano, Garvesh Raskutti, Dan Yang last modified 28 4 2013 1 Outline 1. Background and preliminaries 2. Data manipulation and exercises 3. Data structures

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 7 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, Colonic

More information

Introduction to pepxmltab

Introduction to pepxmltab Introduction to pepxmltab Xiaojing Wang October 30, 2018 Contents 1 Introduction 1 2 Convert pepxml to a tabular format 1 3 PSMs Filtering 4 4 Session Information 5 1 Introduction Mass spectrometry (MS)-based

More information

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model Stat 3302 (Spring 2017) Peter F. Craigmile Simple linear logistic regression (part 1) [Dobson and Barnett, 2008, Sections 7.1 7.3] Generalized linear models for binary data Beetles dose-response example

More information

TMA 4275 Lifetime Analysis June 2004 Solution

TMA 4275 Lifetime Analysis June 2004 Solution TMA 4275 Lifetime Analysis June 2004 Solution Problem 1 a) Observation of the outcome is censored, if the time of the outcome is not known exactly and only the last time when it was observed being intact,

More information

The coxvc_1-1-1 package

The coxvc_1-1-1 package Appendix A The coxvc_1-1-1 package A.1 Introduction The coxvc_1-1-1 package is a set of functions for survival analysis that run under R2.1.1 [81]. This package contains a set of routines to fit Cox models

More information

Survival Analysis. STAT 526 Professor Olga Vitek

Survival Analysis. STAT 526 Professor Olga Vitek Survival Analysis STAT 526 Professor Olga Vitek May 4, 2011 9 Survival Data and Survival Functions Statistical analysis of time-to-event data Lifetime of machines and/or parts (called failure time analysis

More information

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016

Lecture 12. Multivariate Survival Data Statistics Survival Analysis. Presented March 8, 2016 Statistics 255 - Survival Analysis Presented March 8, 2016 Dan Gillen Department of Statistics University of California, Irvine 12.1 Examples Clustered or correlated survival times Disease onset in family

More information

Disease Ontology Semantic and Enrichment analysis

Disease Ontology Semantic and Enrichment analysis Disease Ontology Semantic and Enrichment analysis Guangchuang Yu, Li-Gen Wang Jinan University, Guangzhou, China April 21, 2012 1 Introduction Disease Ontology (DO) provides an open source ontology for

More information

Poisson Regression. The Training Data

Poisson Regression. The Training Data The Training Data Poisson Regression Office workers at a large insurance company are randomly assigned to one of 3 computer use training programmes, and their number of calls to IT support during the following

More information

Quantitative genetic (animal) model example in R

Quantitative genetic (animal) model example in R Quantitative genetic (animal) model example in R Gregor Gorjanc gregor.gorjanc@bfro.uni-lj.si April 30, 2018 Introduction The following is just a quick introduction to quantitative genetic model, which

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Fitting Cox Regression Models

Fitting Cox Regression Models Department of Psychology and Human Development Vanderbilt University GCM, 2010 1 Introduction 2 3 4 Introduction The Partial Likelihood Method Implications and Consequences of the Cox Approach 5 Introduction

More information

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Duration of Unemployment - Analysis of Deviance Table for Nested Models Duration of Unemployment - Analysis of Deviance Table for Nested Models February 8, 2012 The data unemployment is included as a contingency table. The response is the duration of unemployment, gender and

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require

In contrast, parametric techniques (fitting exponential or Weibull, for example) are more focussed, can handle general covariates, but require Chapter 5 modelling Semi parametric We have considered parametric and nonparametric techniques for comparing survival distributions between different treatment groups. Nonparametric techniques, such as

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression Other likelihoods Patrick Breheny April 25 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/29 Introduction In principle, the idea of penalized regression can be extended to any sort of regression

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Introduction to logistic regression

Introduction to logistic regression Introduction to logistic regression Tuan V. Nguyen Professor and NHMRC Senior Research Fellow Garvan Institute of Medical Research University of New South Wales Sydney, Australia What we are going to learn

More information

Lecture 8 Stat D. Gillen

Lecture 8 Stat D. Gillen Statistics 255 - Survival Analysis Presented February 23, 2016 Dan Gillen Department of Statistics University of California, Irvine 8.1 Example of two ways to stratify Suppose a confounder C has 3 levels

More information

ssh tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm

ssh tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm Kedem, STAT 430 SAS Examples: Logistic Regression ==================================== ssh abc@glue.umd.edu, tap sas913, sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm a. Logistic regression.

More information

Using R in 200D Luke Sonnet

Using R in 200D Luke Sonnet Using R in 200D Luke Sonnet Contents Working with data frames 1 Working with variables........................................... 1 Analyzing data............................................... 3 Random

More information

Leftovers. Morris. University Farm. University Farm. Morris. yield

Leftovers. Morris. University Farm. University Farm. Morris. yield Leftovers SI 544 Lada Adamic 1 Trellis graphics Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475 Manchuria No. 462 Svansota Trebi Wisconsin No. 38 No. 457 Glabron Peatland Velvet No. 475

More information

Survival Prediction Under Dependent Censoring: A Copula-based Approach

Survival Prediction Under Dependent Censoring: A Copula-based Approach Survival Prediction Under Dependent Censoring: A Copula-based Approach Yi-Hau Chen Institute of Statistical Science, Academia Sinica 2013 AMMS, National Sun Yat-Sen University December 7 2013 Joint work

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

MODULE 6 LOGISTIC REGRESSION. Module Objectives: MODULE 6 LOGISTIC REGRESSION Module Objectives: 1. 147 6.1. LOGIT TRANSFORMATION MODULE 6. LOGISTIC REGRESSION Logistic regression models are used when a researcher is investigating the relationship between

More information

Survival Analysis I (CHL5209H)

Survival Analysis I (CHL5209H) Survival Analysis Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca January 7, 2015 31-1 Literature Clayton D & Hills M (1993): Statistical Models in Epidemiology. Not really

More information

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016

Lecture 11. Interval Censored and. Discrete-Time Data. Statistics Survival Analysis. Presented March 3, 2016 Statistics 255 - Survival Analysis Presented March 3, 2016 Motivating Dan Gillen Department of Statistics University of California, Irvine 11.1 First question: Are the data truly discrete? : Number of

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

MAS3301 / MAS8311 Biostatistics Part II: Survival

MAS3301 / MAS8311 Biostatistics Part II: Survival MAS3301 / MAS8311 Biostatistics Part II: Survival M. Farrow School of Mathematics and Statistics Newcastle University Semester 2, 2009-10 1 13 The Cox proportional hazards model 13.1 Introduction In the

More information

Age 55 (x = 1) Age < 55 (x = 0)

Age 55 (x = 1) Age < 55 (x = 0) Logistic Regression with a Single Dichotomous Predictor EXAMPLE: Consider the data in the file CHDcsv Instead of examining the relationship between the continuous variable age and the presence or absence

More information

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and

More information

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102 Background Regression so far... Lecture 21 - Sta102 / BME102 Colin Rundel November 18, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Lecture 7. Models with binary response II GLM (Spring, 2018) Lecture 7 1 / 13 Existence of estimates Lemma (Claudia Czado, München, 2004) The log-likelihood ln L(β) in logistic

More information

Homework 5 - Solution

Homework 5 - Solution STAT 526 - Spring 2011 Homework 5 - Solution Olga Vitek Each part of the problems 5 points 1. Agresti 10.1 (a) and (b). Let Patient Die Suicide Yes No sum Yes 1097 90 1187 No 203 435 638 sum 1300 525 1825

More information

Multistate models and recurrent event models

Multistate models and recurrent event models and recurrent event models Patrick Breheny December 6 Patrick Breheny University of Iowa Survival Data Analysis (BIOS:7210) 1 / 22 Introduction In this final lecture, we will briefly look at two other

More information

Exercise 5.4 Solution

Exercise 5.4 Solution Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, 2010 1 5.4(a) > leukemia

More information

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values

More information

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520

REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520 REGRESSION ANALYSIS FOR TIME-TO-EVENT DATA THE PROPORTIONAL HAZARDS (COX) MODEL ST520 Department of Statistics North Carolina State University Presented by: Butch Tsiatis, Department of Statistics, NCSU

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Part III Measures of Classification Accuracy for the Prediction of Survival Times

Part III Measures of Classification Accuracy for the Prediction of Survival Times Part III Measures of Classification Accuracy for the Prediction of Survival Times Patrick J Heagerty PhD Department of Biostatistics University of Washington 102 ISCB 2010 Session Three Outline Examples

More information

R Output for Linear Models using functions lm(), gls() & glm()

R Output for Linear Models using functions lm(), gls() & glm() LM 04 lm(), gls() &glm() 1 R Output for Linear Models using functions lm(), gls() & glm() Different kinds of output related to linear models can be obtained in R using function lm() {stats} in the base

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Multistate models and recurrent event models

Multistate models and recurrent event models Multistate models Multistate models and recurrent event models Patrick Breheny December 10 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/22 Introduction Multistate models In this final lecture,

More information

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016

Lecture 7. Proportional Hazards Model - Handling Ties and Survival Estimation Statistics Survival Analysis. Presented February 4, 2016 Proportional Hazards Model - Handling Ties and Survival Estimation Statistics 255 - Survival Analysis Presented February 4, 2016 likelihood - Discrete Dan Gillen Department of Statistics University of

More information

Unit 5 Logistic Regression Practice Problems

Unit 5 Logistic Regression Practice Problems Unit 5 Logistic Regression Practice Problems SOLUTIONS R Users Source: Afifi A., Clark VA and May S. Computer Aided Multivariate Analysis, Fourth Edition. Boca Raton: Chapman and Hall, 2004. Exercises

More information

Lecture 7 Time-dependent Covariates in Cox Regression

Lecture 7 Time-dependent Covariates in Cox Regression Lecture 7 Time-dependent Covariates in Cox Regression So far, we ve been considering the following Cox PH model: λ(t Z) = λ 0 (t) exp(β Z) = λ 0 (t) exp( β j Z j ) where β j is the parameter for the the

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key Statistical Methods III Statistics 212 Problem Set 2 - Answer Key 1. (Analysis to be turned in and discussed on Tuesday, April 24th) The data for this problem are taken from long-term followup of 1423

More information

Analysing categorical data using logit models

Analysing categorical data using logit models Analysing categorical data using logit models Graeme Hutcheson, University of Manchester The lecture notes, exercises and data sets associated with this course are available for download from: www.research-training.net/manchester

More information

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation Background Regression so far... Lecture 23 - Sta 111 Colin Rundel June 17, 2014 At this point we have covered: Simple linear regression Relationship between numerical response and a numerical or categorical

More information

PAPER 218 STATISTICAL LEARNING IN PRACTICE

PAPER 218 STATISTICAL LEARNING IN PRACTICE MATHEMATICAL TRIPOS Part III Thursday, 7 June, 2018 9:00 am to 12:00 pm PAPER 218 STATISTICAL LEARNING IN PRACTICE Attempt no more than FOUR questions. There are SIX questions in total. The questions carry

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Regression Methods for Survey Data

Regression Methods for Survey Data Regression Methods for Survey Data Professor Ron Fricker! Naval Postgraduate School! Monterey, California! 3/26/13 Reading:! Lohr chapter 11! 1 Goals for this Lecture! Linear regression! Review of linear

More information

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/ Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples. Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression

More information

Week 7 Multiple factors. Ch , Some miscellaneous parts

Week 7 Multiple factors. Ch , Some miscellaneous parts Week 7 Multiple factors Ch. 18-19, Some miscellaneous parts Multiple Factors Most experiments will involve multiple factors, some of which will be nuisance variables Dealing with these factors requires

More information

Generalised linear models. Response variable can take a number of different formats

Generalised linear models. Response variable can take a number of different formats Generalised linear models Response variable can take a number of different formats Structure Limitations of linear models and GLM theory GLM for count data GLM for presence \ absence data GLM for proportion

More information

Neural networks (not in book)

Neural networks (not in book) (not in book) Another approach to classification is neural networks. were developed in the 1980s as a way to model how learning occurs in the brain. There was therefore wide interest in neural networks

More information

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46 A Generalized Linear Model for Binomial Response Data Copyright c 2017 Dan Nettleton (Iowa State University) Statistics 510 1 / 46 Now suppose that instead of a Bernoulli response, we have a binomial response

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Advanced Quantitative Data Analysis

Advanced Quantitative Data Analysis Chapter 24 Advanced Quantitative Data Analysis Daniel Muijs Doing Regression Analysis in SPSS When we want to do regression analysis in SPSS, we have to go through the following steps: 1 As usual, we choose

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression Introduction to the Generalized Linear Model: Logistic regression and Poisson regression Statistical modelling: Theory and practice Gilles Guillot gigu@dtu.dk November 4, 2013 Gilles Guillot (gigu@dtu.dk)

More information

Interactions in Logistic Regression

Interactions in Logistic Regression Interactions in Logistic Regression > # UCBAdmissions is a 3-D table: Gender by Dept by Admit > # Same data in another format: > # One col for Yes counts, another for No counts. > Berkeley = read.table("http://www.utstat.toronto.edu/~brunner/312f12/

More information

Follow-up data with the Epi package

Follow-up data with the Epi package Follow-up data with the Epi package Summer 2014 Michael Hills Martyn Plummer Bendix Carstensen Retired Highgate, London International Agency for Research on Cancer, Lyon plummer@iarc.fr Steno Diabetes

More information