Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Similar documents
STA 450/4000 S: January

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Data Mining Stat 588

Nonparametric Regression. Badr Missaoui

Linear Regression Models P8111

Linear Decision Boundaries

Classification. Chapter Introduction. 6.2 The Bayes classifier

Logistic Regression - problem 6.14

UNIVERSITY OF TORONTO Faculty of Arts and Science

PAPER 218 STATISTICAL LEARNING IN PRACTICE

Generalised linear models. Response variable can take a number of different formats

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Generalized Additive Models

Week 7 Multiple factors. Ch , Some miscellaneous parts

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Modern Regression Basics

Linear Methods for Prediction

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Generalized linear models

Matched Pair Data. Stat 557 Heike Hofmann

Various Issues in Fitting Contingency Tables

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Logistic Regression and Generalized Linear Models

Logistic Regression 21/05

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Lecture 4 Multiple linear regression

R Hints for Chapter 10

Logistic Regressions. Stat 430

Machine Learning for OR & FE

Biostatistics Advanced Methods in Biostatistics IV

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

High-dimensional regression

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Estimating prediction error in mixed models

Introduction to Regression

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Likelihood Ratio Tests. that Certain Variance Components Are Zero. Ciprian M. Crainiceanu. Department of Statistical Science

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Exam Applied Statistical Regression. Good Luck!

An Introduction to GAMs based on penalized regression splines. Simon Wood Mathematical Sciences, University of Bath, U.K.

MATH 644: Regression Analysis Methods

Lecture 3: Statistical Decision Theory (Part II)

Introduction to Regression

Sparse Linear Models (10/7/13)

Chapter 7: Model Assessment and Selection

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

On the Behavior of Marginal and Conditional Akaike Information Criteria in Linear Mixed Models

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Lecture 5: LDA and Logistic Regression

Regression, Ridge Regression, Lasso

R Output for Linear Models using functions lm(), gls() & glm()

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

MS&E 226: Small Data

A Modern Look at Classical Multivariate Techniques

Generalized Linear Models in R

Building a Prognostic Biomarker

Bi-level feature selection with applications to genetic association

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Exercise 5.4 Solution

Machine Learning Lecture 7

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Neural networks (not in book)

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

PAPER 206 APPLIED STATISTICS

Modeling Overdispersion

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Regression Methods for Survey Data

Machine Learning Linear Classification. Prof. Matteo Matteucci

Age 55 (x = 1) Age < 55 (x = 0)

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Alternatives. The D Operator

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

UNIVERSITETET I OSLO

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

A Short Introduction to the Lasso Methodology

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Spatial Process Estimates as Smoothers: A Review

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Statistical Data Mining and Machine Learning Hilary Term 2016

MS&E 226: Small Data

Interactions in Logistic Regression

Stat 5101 Lecture Notes

Statistical Prediction

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

6.867 Machine Learning

Introduction to Statistical modeling: handout for Math 489/583

STAT 420: Methods of Applied Statistics

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Dimension Reduction Methods

Transcription:

1 / 23 Recap HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis: Pr(G = k X) Pr(X G = k)pr(g = k) Theory: LDA more efficient if X really is normally distributed Practice: LR usually viewed as more robust, but HTF claim prediction errors are very similar Data: LR is more complicated with K > 2; use multinom in the MASS library Deviance in a generalized linear model (such as LR), is 2 log L( ˆβ) + constant Comparing deviances from two model fits is a log-likelihood ratio test that the corresponding parameters are 0 AIC compares instead 2 log L( ˆβ) + 2p see heart-dump.txt using dummy variables for inputs (board)

2 / 23 Separating hyperplanes ( 4.5) assume two classes only; change notation so that y = ±1 use linear combinations of inputs to predict y { 1 y = +1 as β 0 + x T β < 0 β 0 + x T β > 0 misclassification error D(β, β 0 ) = Σ i M y i (β 0 + x T i β) where M = {j; y j (β 0 + x T j β) < 0} note that D(β) > 0 and proportional to the size of β 0 + x T i β Can show that an algorithm to minimize D(β, β 0 ) exists and converges the the plane that separates y = +1 from y = 1 if such a plane exists But it will cycle if no such plane exists and be very slow if the gap is small

Also if one plane exists there is likely many (Figure 4.14) The plane that defines the largest gap is defined to be best can show that this needs to s.t. 1 min β 0,β 2 β 2 y i (β 0 + x T i β) 1, i = 1,... N (4.48) See Figure 4.16 the points on the edges (margin) of the gap called support points or support vectors; there are typically many fewer of these than original points this is the basis for the development of Support Vector Machines (SVM), more later sometimes add features by using basis expansions; to be discussed first in the context of regression (Chapter 5) 3 / 23

4 / 23 Flexible modelling using basis expansions (Chapter 5) Linear regression: y = Xβ + ɛ, ɛ (0, σ 2 ) Smooth regression: y = f (X) + ɛ f (X) = E(Y X) to be specified Flexible linear modelling f (X) = Σ M m=1 β mh m (X) This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or f (X) = β 0 + β 1 sin(x) + β 2 cos(x), etc. Simple linear regression has h 1 (X) = 1, h 2 (X) = X. Several other examples on p.140

5 / 23 Polynomial fits: h j (x) = x j, j = 0,..., m Fit using linear regression with design matrix X, where X ij = h j (x i ) Justification is that any smooth function can be approximated by a polynomial expansion (Taylor series) Can be difficult to fit numerically, as correlation between columns can be large May be useful locally, but less likely to work over the range of X (rafal.pdf) f (x) = x 2 sin(2x), ɛ N(0, 2), n = 200 It turns out to be easier to find a good set of basis functions if we only consider small segments in the X-space (local fits) Need to be careful not to overfit, since we are using only a fraction of the data

6 / 23 Piecewise polynomials Piecewise constant: h 1 (X) = I(X < ξ 1 ), h 2 (X) = I(ξ 1 X < ξ 2 ), h 3 (x) = I(ξ 2 X); corresponds to fitting by local averaging Similarly for piecewise linear fits (Figure 5.1), but constraints to make it continuous at the break points: h 1 (X) = 1, h 2 (X) = X, h 3 (X) = (X ξ 1 ) +, h 4 (X) = (X ξ 2 ) + windows defined by knots ξ 1, ξ 2,... To fit a cubic polynomial in each window: e.g. a 1 + b 1 X + c 1 X 2 + d 1 X 3 in the first, a 2 + b 2 X + c 2 X 2 + d 2 X 3 in second, etc. basis functions {1 w1, X w1, Xw 2 1, Xw 3 1 }, {1 w2, X w2, Xw 2 2, Xw 3 2 }, etc. Now require f w (k) 1 (ξ 1 ) = f w (k) 2 (ξ 1 ) etc., puts constraints on the coefficients a, b, c, d Cubic splines require continuous function, first derivatives, second derivatives (Figure 5.2)

7 / 23 Constraints on derivatives can be incorporated into the basis: {1, X, X 2, X 3, (X ξ 1 ) 3 +,..., (X ξ k) 3 + } the truncated power basis procedure: choose number (K ) and placement of knots ξ 1,... ξ K construct X matrix using truncated power basis set run linear regression with?? degrees of freedom Example: heart failure data from Chapter 4 (462 observations, 10 covariates)

8 / 23 > bs.sbp <- bs(hr$sbp,df=4) # the B-spline basis > dim(bs.sbp) [1] 462 4 # this is the basis matrix > bs.sbp[1:4,] 1 2 3 4 [1,] 0.16968090 0.4511590 0.34950621 0.029653925 [2,] 0.35240669 0.4758851 0.17002102 0.001687183 [3,] 0.71090498 0.1642420 0.01087580 0.000000000 [4,] 0.09617733 0.3769808 0.44812470 0.078717201

9 / 23 The B-spline basis is hard to described explicitly (see Appendix to Ch. 5), but can be shown to be equivalent to the truncated power basis: In R library(splines): bs(x, df=null, knots=null, degree=3,intercept=false, Boundary.knots=range(x)) Must specify either df or knots. For the B-spline basis, # knots = df - degree (degree is usually 3: see?bs). The knots are fixed, even if you use df (see R code) Natural cubic splines have better endpoint behaviour (linear) ( 5.2.1) ns(x, df=null, knots=null, degree=3,intercept=false, Boundary.knots=range(x)) For natural cubic splines, # knots = df - 1

Regression splines (p.144) refers to using these basis matrices in a regression model. > ns.hr.glm <- glm (chd ns.sbp+ns.tobacco+ns.ldl+famhist+ns.obesity + + ns.alcohol + ns.age, family=binomial, data=hr) > summary(ns.hr.glm) Call: glm(formula = chd ns.sbp + ns.tobacco + ns.ldl + famhist + ns.obesity + ns.alcohol + ns.age, family = binomial, data = hr) Deviance Residuals: Min 1Q Median 3Q Max -1.7245-0.8265-0.3884 0.8870 2.9588 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.1158 2.4067-0.879 0.379324 ns.sbp1-1.4794 0.8440-1.753 0.079641. ns.sbp2-1.3197 0.7632-1.729 0.083769. ns.sbp3-3.7537 2.0230-1.856 0.063520. ns.sbp4 1.3973 1.0037 1.392 0.163884 ns.tobacco1 0.6495 0.4586 1.416 0.156691 ns.tobacco2 0.4181 0.9031 0.463 0.643397 ns.tobacco3 3.3626 1.1873 2.832 0.004625 ** ns.tobacco4 3.8534 2.3769 1.621 0.104976 ns.ldl1 1.8688 1.3266 1.409 0.158933 ns.ldl2 1.7217 1.0320 1.668 0.095248. ns.ldl3 4.5209 2.9986 1.508 0.131643 ns.ldl4 3.3454 1.4523 2.304 0.021249 * famhistpresent 1.0787 0.2389 4.515 6.34e-06 *** ns.obesity1-3.1058 1.7187-1.807 0.070748. ns.obesity2-2.3753 1.2042-1.972 0.048555 * ns.obesity3-5.0541 3.8205-1.323 0.185871 10 / 23

11 / 23 The individual coefficients don t mean anything, we need to evaluate groups of coefficients. We can do this with successive likelihood ratio tests, by hand, e.g. > glm(hr$chd ns.sbp + ns.tobacco + ns.ldl + hr$famhist + ns.obesity + ns.age, family=binomia Call: glm(formula = hr$chd ns.sbp + ns.tobacco + ns.ldl + hr$famhist + ns.obesity + ns Coefficients: (Intercept) ns.sbp1 ns.sbp2-2.265534-1.474172-1.351182 ns.sbp3 ns.sbp4 ns.tobacco1-3.729348 1.381701 0.654109 ns.tobacco2 ns.tobacco3 ns.tobacco4 0.392582 3.335170 3.845611 ns.ldl1 ns.ldl2 ns.ldl3 1.921215 1.783272 4.623680... stuff omitted Degrees of Freedom: 461 Total (i.e. Null); Null Deviance: 596.1 Residual Deviance: 458.1 AIC: 502.1 440 Residual

12 / 23 > update(.last.value,.. - ns.sbp) Call: glm(formula = hr$chd ns.tobacco + ns.ldl + hr$famhist + ns.obesity + ns.age, fam Coefficients: (Intercept) ns.tobacco1 ns.tobacco2-3.91758 0.61696 0.46188 ns.tobacco3 ns.tobacco4 ns.ldl1 3.51363 3.82464 1.70945 ns.ldl2 ns.ldl3 ns.ldl4 1.70659 4.19515 2.90793 hr$famhistpresent ns.obesity1 ns.obesity2 0.99053-2.93143-2.32793 ns.obesity3 ns.obesity4 ns.age1-4.87074-0.01103 2.52772 ns.age2 ns.age3 ns.age4 3.12963 7.34899 1.53433 Degrees of Freedom: 461 Total (i.e. Null); Null Deviance: 596.1 Residual Deviance: 467.2 AIC: 503.2 444 Residual > 467.2-458.1 [1] 9.1 > pchisq(9.1,df=4) [1] 0.941352 > 1-.Last.value [1] 0.05864798 # compare Table 5.1 See Figure 5.4

The function step does all this for you: > step(ns.hr.glm) Start: AIC=509.63 chd ns(sbp, 4) + ns(tobacco, 4) + ns(ldl, 4) + famhist + ns(obesity, 4) + ns(alcohol, 4) + ns(age, 4) Df Deviance AIC - ns(alcohol, 4) 4 458.09 502.09 - ns(obesity, 4) 4 465.41 509.41 <none> 457.63 509.63 - ns(sbp, 4) 4 466.77 510.77 - ns(tobacco, 4) 4 469.61 513.61 - ns(ldl, 4) 4 470.90 514.90 - ns(age, 4) 4 480.37 524.37 - famhist 1 478.76 528.76 Step: AIC=502.09 chd ns(sbp, 4) + ns(tobacco, 4) + ns(ldl, 4) + famhist + ns(obesity, 4) + ns(age, 4) Df Deviance AIC <none> 458.09 502.09 - ns(obesity, 4) 4 466.24 502.24 - ns(sbp, 4) 4 467.16 503.16 - ns(tobacco, 4) 4 470.48 506.48 - ns(ldl, 4) 4 472.39 508.39 - ns(age, 4) 4 481.86 517.86 - famhist 1 479.44 521.44 Call: glm(formula = chd ns(sbp, 4) + ns(tobacco, 4) + ns(ldl, 4) + famhist + ns(obesit See Table 5.1 13 / 23

14 / 23 notes 1. The degrees of freedom fitted are the number of columns in the basis matrix (+ 1 for the intercept). This can also be computed as the trace of the hat matrix, which can be extracted from lm. 2. There is something analogous for glm, because glm s are fitted using iteratively reweighted least squares. 3. Skip 5.2.3 and 5.3 (for now at least)

15 / 23 Smoothing splines ( 5.4) ridge regression applied to natural cubic splines Recall y = f (X) + ɛ; f (X) = Σ M m=1 β mh m (X) change of notation f (X) = Σ N j=1 θ jn j (X) knots at each distinct x value min(y Nθ) T (y Nθ) + λθ T Ω N θ (5.11) θ N ij = N j (x i ), Ω jk = N j (t)n k (t)dt note use of N for set of natural splines solution ˆθ = (N T N + λω N ) 1 N T y

16 / 23 Smoothing splines ( 5.4) solution ˆθ = (N T N + λω N ) 1 N T y This solves the variational problem argmin f Σ i (y i f (x i )) 2 + λ b a {f (t)} 2 dt the solution is a natural cubic spline with knots at each x i How many parameters have been fit? It can be shown that the solution to the smoothing spline problem gives fitted values of the form ŷ = S λ y

17 / 23 Smoothing splines ( 5.4) By analogy with ordinary regression, define the effective degrees of freedom (EDF) as trace S λ In the smoothing case it can be shown that ŷ smooth = N(N T N + λω N ) 1 N T y where N is the basis matrix. See p. 153 155 for details on S λ, the smoothing matrix. How to choose λ? a) Decide on df to be used up, e.g. smooth.spline(x,y,df=6), note that increasing df means less bias and more variance. b) Automatic selection by cross-validation (Figure 5.9)

Smoothing Parameter spar= 0.7801305 lambda= 0.1112206 Equivalent Degrees of Freedom (Df): 2.635278 Penalized Criterion: 4337.638 GCV: 244.1044 > lines(cars.spl, col = "blue") > lines(smooth.spline(speed, dist, df=10), lty=2, c > legend(5,120,c(paste("default [C.V.] => df =",rou + "s( *, df = 10)"), col = c("blue" + bg= bisque ) 18 / 23 A smoothing spline version of logistic regression is outlined in 5.6, but we ll wait till we discuss generalized additive models. An example from the R help file for smooth.spline: > data(cars) > attach(cars) > plot(speed, dist, main = "data(cars) & smoothing spl > cars.spl <- smooth.spline(speed, dist) > (cars.spl) Call: smooth.spline(x = speed, y = dist)

data(cars) & smoothing splines dist 0 20 40 60 80 100 120 5 10 15 20 25 speed

data(cars) & smoothing splines dist 0 20 40 60 80 100 120 5 10 15 20 25 speed

data(cars) & smoothing splines dist 0 20 40 60 80 100 120 5 10 15 20 25 speed

data(cars) & smoothing splines dist 0 20 40 60 80 100 120 default [C.V.] => df = 2.6 s( *, df = 10) 5 10 15 20 25 speed

23 / 23 Multidimensional splines ( 5.7) so far we are considering just 1 X at a time for regression splines we replace each X by the new columns of the basis matrix for smoothing splines we get a univariate regression it is possible to construct smoothing splines for two or more inputs simultaneously, but computational difficulty increases rapidly these are called thin plate splines