Machine Learning. Module 3-4: Regression and Survival Analysis Day 2, Asst. Prof. Dr. Santitham Prom-on

Similar documents
Exploratory data analysis

Statistics in medicine

BIOSTATISTICS NURS 3324

TMA 4275 Lifetime Analysis June 2004 Solution

Correlation and Regression

Correlation and Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

1 The problem of survival analysis

MAS3301 / MAS8311 Biostatistics Part II: Survival

Simple Linear Regression

Survival Analysis Math 434 Fall 2011

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Simple Linear Regression Using Ordinary Least Squares

Introduction and Single Predictor Regression. Correlation

Simple Linear Regression

Measuring the fit of the model - SSR

Correlation & Regression. Dr. Moataza Mahmoud Abdel Wahab Lecturer of Biostatistics High Institute of Public Health University of Alexandria

Scatter plot of data from the study. Linear Regression

Chapter 4. Regression Models. Learning Objectives

Introduction to Statistical Analysis

Lecture 11: Simple Linear Regression

TMA4255 Applied Statistics V2016 (5)

STAT 4385 Topic 03: Simple Linear Regression

Multiple Regression: Chapter 13. July 24, 2015

Extra Exam Empirical Methods VU University Amsterdam, Faculty of Exact Sciences , July 2, 2015

Scatter plot of data from the study. Linear Regression

Ph.D. course: Regression models. Introduction. 19 April 2012

Midterm 2 - Solutions

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Analysing data: regression and correlation S6 and S7

Ph.D. course: Regression models. Regression models. Explanatory variables. Example 1.1: Body mass index and vitamin D status

LINEAR REGRESSION ANALYSIS. MODULE XVI Lecture Exercises

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Nonparametric Model Construction

STAT Chapter 11: Regression

Review of Statistics 101

Turning a research question into a statistical question.

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

23. Inference for regression

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Simple Linear Regression Analysis

Regression Analysis and Forecasting Prof. Shalabh Department of Mathematics and Statistics Indian Institute of Technology-Kanpur

Using regression to study economic relationships is called econometrics. econo = of or pertaining to the economy. metrics = measurement

Conflicts of Interest

Final Exam - Solutions

Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression

Simple Linear Regression

Tied survival times; estimation of survival probabilities

11 Correlation and Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Ch 2: Simple Linear Regression

Chapter 8. Linear Regression /71

1. Simple Linear Regression

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Lecture 3. Truncation, length-bias and prevalence sampling

STAT 511. Lecture : Simple linear regression Devore: Section Prof. Michael Levine. December 3, Levine STAT 511

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Single and multiple linear regression analysis

Final Exam - Solutions

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Correlation & Linear Regression. Slides adopted fromthe Internet

Lecture 14 Simple Linear Regression

9. Linear Regression and Correlation

Chapter 4 Describing the Relation between Two Variables

Correlation and Regression

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Big Data Analysis with Apache Spark UC#BERKELEY

Semiparametric Regression

Typical Survival Data Arising From a Clinical Trial. Censoring. The Survivor Function. Mathematical Definitions Introduction

Review of Statistics

Bivariate Relationships Between Variables

Basic Medical Statistics Course

STATISTICS 1 REVISION NOTES

SF2930: REGRESION ANALYSIS LECTURE 1 SIMPLE LINEAR REGRESSION.

Inference for Regression Inference about the Regression Model and Using the Regression Line

CIMAT Taller de Modelos de Capture y Recaptura Known Fate Survival Analysis

Proportional hazards regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

Extensions of Cox Model for Non-Proportional Hazards Purpose

y n 1 ( x i x )( y y i n 1 i y 2

regression analysis is a type of inferential statistics which tells us whether relationships between two or more variables exist

where Female = 0 for males, = 1 for females Age is measured in years (22, 23, ) GPA is measured in units on a four-point scale (0, 1.22, 3.45, etc.

Linear regression and correlation

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis

Lecture 10 Multiple Linear Regression

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

Simple Linear Regression

Dependence and scatter-plots. MVE-495: Lecture 4 Correlation and Regression

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

A discussion on multiple regression models

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Lectures on Simple Linear Regression Stat 431, Summer 2012

Correlation and Regression

Ch. 1: Data and Distributions

Correlation & Simple Regression

STATISTICS Relationships between variables: Correlation

Statistical aspects of prediction models with high-dimensional data

Transcription:

Machine Learning Module 3-4: Regression and Survival Analysis Day 2, 9.00 16.00 Asst. Prof. Dr. Santitham Prom-on Department of Computer Engineering, Faculty of Engineering King Mongkut s University of Technology Thonburi

Module 3 Overview Linear regression Poisson regression Survival analysis

Simple linear regression Simple linear regression allows us to summarize and study relationships between two continuous (quantitative) variables: One variable, denoted x, is regarded as the predictor, explanatory, or independent variable. The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Types of relationship Deterministic Statistical

Example of statistical relationships Height and weight Alcohol consumed and blood alcohol content Spending amount and number of spending Number of exercises per week and loosing weight Credit-line and average credit usage

Assessing statistical linear relationship Graphical Scatter plot Statistics Correlation

Correlation coefficient The correlation coefficient computed from the sample data measures the strength and direction of a linear relationship between two variables. The symbol for the sample correlation coefficient is r. The symbol for the population correlation coefficient is ρ. There are several types of correlation coefficients. The one explained in this section is called the Pearson product moment correlation coefficient (PPMC)

Pearson product moment correlation Given n pairs of observations (x 1,y 1 ), (x 2,y 2 ),,(x n,y n ) It is natural to speak of x and y having a positive relationship if large x s are paired with large y s and small x s with small y s On the contrary, if large x s are paired with small y s and small x s with large y s, then a negative relationship between the variable is implied

Pearson product moment correlation Consider the quantity n s = å x -x y - y ( )( ) xy i i i= 1 Then, if the relationship is strongly positive, an x i above the mean will tend to be paired with a y i above the mean, so that and this product will also be positive whenever both x i and y i are below their means x -x y - y > ( )( ) 0 i i

Pearson product moment correlation To make this measure dimensionless, we divide as follow r s å ( x -x)( y - y) xy i= 1 i i = = n 2 n 2 sxx s å yy ( x ) ( ) i 1 i -x å y i 1 i - y = = A more convenient form for this equation is r = n ( å ) ( )( ) i i - å å i i n xy x y ( 2) ( ) ( 2) ( ) i - i i - i 2 2 å å å å n x x n y y

Correlation coefficient and scatter plot The range of the correlation coefficient is from -1 to 1. If there is a strong positive linear relationship between the variables, the value of r will be close to 1. If there is a strong negative linear relationship, the value of r will be close to -1. When there is no linear relationship between the variables or only a weak one, the value of r will be close to 0

Strong correlation? A frequently asked question is: what can it be said that there is a strong correlation between variables, and when is the correlation weak? A reasonable rule of thumb is to say that the correlation is weak if 0 r 0.5 strong 0.8 r 1 moderate otherwise

Correlation and shape

Python: Loading data

Pairwise Correlation

Correlation and scatter plot Negative relationship, mpg vs wt

Correlation and scatter plot Positive relationship, hp vs disp

Correlation and scatter plot No relationship, qsec vs drat

Scatter and trend Since we are interested in summarizing the trend between two quantitative variables, the natural question arises "what is the best fitting line?" Scatter plot shows points distribution and potentially the trend in the data Look at the figure in next pages, which lines do you think best summarizes the trend between height and weight?

Line equation y i denotes the observed response for experimental unit i x i denotes the predictor value for experimental unit i ŷ i is the predicted response (or fitted value) for experimental unit i The equation for best fitting line is: ŷ i = b 0 + b 1 x i

Regression parameters Given a value x i, how do we interpret b 0 and b 1. b 1 tells us: if the value we saw for x was one unit bigger, how much would our prediction for y changes? b 0 tells us: what would we predict for y if x = 0?

Residual must be normally distributed with zero mean

Best line

Error In general, when we use ŷ i = b 0 + b 1 x i to predict the actual response y i, we make a prediction error (or residual error) of size: e i = y i ŷ i A line that fits the data "best" will be one for which the n prediction errors one for each observed data point are as small as possible in some overall sense.

Method of least squares Choose the b s so that the sum of the squares of the errors, e i, are minimized The error function is '! = # $%& ' = # $%& ( $ ) $ +, + & - $.

Ordinary least square solution Minimum of a function is the point where the slope is zero E(α) E(x) xα

Coefficient of determination (r 2 ) The coefficient of determination is a number that indicates the proportion of the variance in the dependent variable that is predictable from the independent variables

Coefficient of determination The coefficient of determination R 2 (or sometimes r 2 ) is another measure of how well the least squares equation Y = α + βx perform as a predictor of y R 2 is computed as: R 2 SS yy - SSE SS yy SSE SSE = = - = 1- SS SS SS SS yy yy yy yy R 2 measures the relative sizes of SS yy and SSE. The smaller SSE, the more reliable the predictions obtained from the model.

Coefficient of determination SS yy measures the deviation of the observations from their mean: yy ( ) 2 = åi i - SS y y SSE measures the deviation of observations from their predicted values ( ) 2 = åi i - SSE y Y i

Coefficient of determination The higher the R 2, the more useful the model R 2 takes on values between 0 and 1 Essentially, R 2 tells us how much better we can do in predicting y by using the model and computing Y than by just using the mean of y as a predictor. Note that when we use the model and compute Y the prediction depends on X because Y = α + βx. Thus, we act as if x contains information about y. If we just use the mean of y to predict y, then we are saying that x does not contribute information about y and thus our predictions of y do not depend on x.

Python: Separate X and Y

Simple linear regression No intercept

Coefficients Estimates: values of the coefficients Standard errors: This measures the average amount that the coefficient estimates vary from the actual average value of our response variable t-value: The coefficient t-value is a measure of how many standard deviations our coefficient estimate is far away from 0. Pr(>t): This indicates whether the probability that the impact of the parameter is due to chance.

Simple linear regression With intercept

Residual and R 2 The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. In multiple regression settings, the R 2 will always increase as more variables are included in the model. That s why the adjusted R 2 is the preferred measure as it adjusts for the number of variables considered.

F-statistics F-statistic is a good indicator of whether there is a relationship between our predictor and the response variables. The further the F-statistic is from 1 the better it is.

R: prediction with linear regression

Root mean square error

Multiple linear regression We move from the simple linear regression model with one predictor to the multiple linear regression model with two or more predictors. That is, we use the adjective "simple" to denote that our model has only predictor, and we use the adjective "multiple" to indicate that our model has at least two predictors

Multiple regression Additive model, no interaction Multiple linear regression model structure is exactly the same as the linear regression f ( x) = w 0 + w 1 x 1 + w 2 x 2 +! Mathematically, parameters are obtained by least square method

Multiple regression additive model

Interaction in multiple regression Adding interaction terms to a regression model can greatly expand understanding of the relationships among the variables in the model This occurs when two or more variables depend on one another for the outcome For example, drug and alcohol may interact and creates addition (adverse) affect Another example, credit card types and number of active month may interact (depend) and have different spending results

Adding interaction term

Multiple regression multiplicative model (interaction)

Survival analysis

Logistic regression vs time In logistic regression, we were interested in studying how risk factors were associated with presence or absence of disease. Sometimes, we are interested in how a risk factor or treatment affects time to disease or some other event. In these cases, logistic regression is not appropriate.

Survival analysis Survival analysis is used to analyze data in which the time until the event is of interest. The response is often referred to as a failure time, survival time, or event time.

Example: Time until tumor recurrence Time until a machine part fails Time until mobile phone recharge Time until the next credit card usage

The survival time response Usually continuous May be incompletely determined for some subjects For some subjects we may know that their survival time was at least equal to some time t. Whereas, for other subjects, we will know their exact time of event. Incompletely observed responses are censored Is always 0.

Analysis issue If there is no censoring, standard linear regression procedures could be used. However, these may be inadequate because Time to event is restricted to be positive and has a skewed distribution. The probability of surviving past a certain point in time may be of more interest than the expected time of event. The hazard function, used for regression in survival analysis, can lend more insight into the failure mechanism than linear regression.

Censoring Censoring is present when we have some information about a subject s event time, but we don t know the exact event time. For the analysis methods we will discuss to be valid, censoring mechanism must be independent of the survival mechanism.

Reasons censoring might occur A subject does not experience the event before the study ends A person is lost to follow-up during the study period A person withdraws from the study These are all examples of right-censoring.

Terminology and notation

Survival function The survival function gives the probability that a subject will survive past time t. As t ranges from 0 to, the survival function has the following properties It is non-increasing At time t= 0, S(t) = 1. In other words, the probability of surviving past time 0 is 1. At time t=, S(t)=S( )=0. As time goes to infinity, the survival curve goes to 0. In theory, the survival function is smooth. In practice, we observe events on a discrete time scale (days, weeks, etc.).

Survival data

Survival analysis data

Kaplan-Meier estimator

Multiple groups

Parametric survival functions The Kaplan-Meier estimator is a very useful tool for estimating survival functions. Sometimes, we may want to make more assumptions that allow us to model the data in more detail.

Benefit of using parametric survival functions By specifying a parametric form for S(t), we can easily compute selected quantiles of the distribution estimate the expected failure time derive a concise equation and smooth function for estimating S(t),H(t) and h(t) estimate S(t) more precisely than KM assuming the parametric form is correct!

Cox proportional hazards regression model The Cox PH model is a semiparametric model makes no assumptions about the form of h(t) (nonparametric part of model) assumes parametric form for the effect of the predictors on the hazard In most situations, we are more interested in the parameter estimates than the shape of the hazard. The Cox PH model is well-suited to this goal.

Survival regression While the above KaplanMeierFitter is useful, they only give us an average view of the population. Often we have specific data at the individual level, either continuous or categorical, that we would like to use. For this, we turn to survival regression, specifically CoxPHFitter.

Python: Load survival regression data

Survival regression

Survival regression result

Result

Thank you Question?