STA 450/4000 S: January

Similar documents
Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Classification. Chapter Introduction. 6.2 The Bayes classifier

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Linear Regression Models P8111

Lecture 5: LDA and Logistic Regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Linear Methods for Prediction

Logistic Regression and Generalized Linear Models

MS&E 226: Small Data

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

R Hints for Chapter 10

MS&E 226: Small Data

Logistic Regressions. Stat 430

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Generalized Linear Models. Kurt Hornik

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Logistic Regression - problem 6.14

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Outline of GLMs. Definitions

Logistic Regression 21/05

Log-linear Models for Contingency Tables

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Regularized Discriminant Analysis and Reduced-Rank LDA

Lecture 9: Classification, LDA

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Generalized linear models

Linear Methods for Prediction

Classification 2: Linear discriminant analysis (continued); logistic regression

9 Generalized Linear Models

MSH3 Generalized linear model Ch. 6 Count data models

Applied Multivariate and Longitudinal Data Analysis

STAC51: Categorical data Analysis

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Various Issues in Fitting Contingency Tables

12 Modelling Binomial Response Data

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Exercise 5.4 Solution

Generalized Linear Models. stat 557 Heike Hofmann

Lecture 5: Classification

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Interactions in Logistic Regression

Topics in this module. Learning material for this module: Move to: James et al (2013): An Introduction to Statistical Learning. Chapter 2.2.3, 4.

Matched Pair Data. Stat 557 Heike Hofmann

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Introduction to General and Generalized Linear Models

MSH3 Generalized linear model

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to the Logistic Regression Model

Stat 579: Generalized Linear Models and Extensions

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Poisson Regression. The Training Data

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

R Output for Linear Models using functions lm(), gls() & glm()

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Generalized Linear Models 1

Age 55 (x = 1) Age < 55 (x = 0)

Ch 2: Simple Linear Regression

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Modeling Overdispersion

Neural networks (not in book)

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Introduction to Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Does Modeling Lead to More Accurate Classification?

Answer Key for STAT 200B HW No. 8

A Study of Relative Efficiency and Robustness of Classification Methods

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Generalized Additive Models

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

Stat 5101 Lecture Notes

Machine Learning Linear Classification. Prof. Matteo Matteucci

On the Inference of the Logistic Regression Model

Classification 1: Linear regression of indicators, linear discriminant analysis

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Week 7 Multiple factors. Ch , Some miscellaneous parts

Regression Methods for Survey Data

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Data Mining 2018 Logistic Regression Text Classification

Generalized Estimating Equations

UNIVERSITY OF TORONTO Faculty of Arts and Science

Machine Learning Lecture 7

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Transcription:

STA 450/4000 S: January 6 005 Notes Friday tutorial on R programming reminder office hours on - F; -4 R The book Modern Applied Statistics with S by Venables and Ripley is very useful. Make sure you have the MASS library available when using R or Splus (in R type library(mass)). All the code in the 4th edition of the book is available in a file called scripts, in the MASS subdirectory of the R library. On Cquest this is in /usr/lib/r/library. Undergraduate Summer Research Awards (USRA): see Statistics office SS 608: application due Feb 8 :,

Logistic regression ( 4.4) Likelihood methods log-likelihood l(β) = Σ N i= y iβ T x i log( + e βt x i ) Maximum likelihood estimate of β: l(β) β = 0 ΣN i= y ix ij = Σp i ( ˆβ)x ij, j =,..., p Fisher information l(β) β β T = ΣN i= x ix T i p i ( p i ) Fitting: use an iteratively reweighted least squares algorithm; equivalent to Newton-Raphson; p.99 Asymptotics: ˆβ d N(β, { l ( ˆβ)} ) :,

Logistic regression ( 4.4) Inference Component: ˆβ j N(β j, ˆσ j ) ˆσ j = [{ l ( ˆβ)} ] jj ; gives a t-test (z-test) for each component {l( ˆβ) l(β j, β j )} χ dimβ j ; in particular for each component get a χ, or equivalently sign( ˆβ j β j ) [{l( ˆβ) l(β j, β j )} N(0, ) To compare models M 0 M can use this twice to get {l M ( ˆβ) l M0 ( β q )} χ p q which provides a test of the adequacy of M 0 LHS is the difference in (residual) deviances; analogous to SS in regression See Ch. 4 of 0 text, and algorithm on p.99 of HTF. (See Figure 4.) (See R code) :,

Logistic regression ( 4.4) extensions E(y i ) = p i, var(y i ) = p i ( p i ) under Bernoulli Often the model is generalized to allow var(y i ) = φp i ( p i ); called over-dispersion Most software provides an estimate of φ based on residuals. if y i Binom(n i, p i ) same model applies E(y i ) = n i p i and var(y i ) = n i p i ( p i ) under Binomial Model selection uses a C p -like criterion called AIC In Splus or R, use glm to fit logistic regression, stepaic for model selection :, 4

Logistic regression ( 4.4) > hr <- read.table("heart.data",header=t) > dim(hr) [] 46 > hr <- data.frame(hr) > pairs(hr[:0],pch=,bg=c("red","green")[codes(factor(hr$chd))]) > glm(chd sbp+tobacco+ldl+famhist+obesity+alcohol+age, + family=binomial,data=hr) Call: glm(formula = chd sbp + tobacco + ldl + famhist + obesity + alcohol + age, family = binomial, data = hr) Coefficients: (Intercept) sbp tobacco ldl -4.90787 0.0057608 0.07957 0.84770 famhistpresent obesity alcohol age 0.990-0.045467 0.0006058 0.04544 Degrees of Freedom: 46 Total (i.e. Null); 454 Residual Null Deviance: 596. Residual Deviance: 48. AIC: 499. > hr.glm <-.Last.value > coef(hr.glm) (Intercept) sbp tobacco ldl famhistpresent -4.9078750 0.005760899 0.0795750 0.847709867 0.9904 obesity alcohol age -0.045466980 0.000605845 0.04544469 :, 5

Logistic regression ( 4.4) > summary(hr.glm) Call: glm(formula = chd sbp + tobacco + ldl + famhist + obesity + alcohol + age, family = binomial, data = hr) Deviance Residuals: Min Q Median Q Max -.757-0.879-0.455 0.99.44 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.90787 0.960686-4.98.7e-05 *** sbp 0.0057608 0.005650.04 0.0577 tobacco 0.07957 0.06876.07 0.009 ** ldl 0.84770 0.0576.4 0.007 ** famhistpresent 0.990 0.468 4.86.84e-05 *** obesity -0.045467 0.0905 -.89 0.440 alcohol 0.0006058 0.0044490 0.6 0.8968 age 0.04544 0.0095 4.99.68e-05 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. (Dispersion parameter for binomial family taken to be ) Null deviance: 596. on 46 degrees of freedom Residual deviance: 48.7 on 454 degrees of freedom AIC: 499.7 Number of Fisher Scoring iterations: :, 6

Logistic regression ( 4.4) > library(mass) > hr.step <- stepaic(hr.glm,trace=f) > anova(hr.step) Analysis of Deviance Table Model: binomial, link: logit Response: chd Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 46 596. tobacco 4.46 460 554.65 ldl. 459 5.5 famhist 4.8 458 507.4 age.80 457 485.44 > coef(hr.step) (Intercept) tobacco ldl famhistpresent age -4.0899 0.0806979 0.675745 0.940600 0.0440574 :, 7

Logistic regression ( 4.4) > summary(hr.step) Call: glm(formula = chd tobacco + ldl + famhist + age, family = binomial, data = hr) Deviance Residuals: Min Q Median Q Max -.7559-0.86-0.4546 0.9457.490 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.080 0.49457-8.50 < e-6 *** tobacco 0.080698 0.05484.67 0.0054 ** ldl 0.67574 0.05409.098 0.0095 ** famhistpresent 0.94060 0.66 4.50.e-05 *** age 0.04406 0.009696 4.54 5.58e-06 *** --- Signif. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. (Dispersion parameter for binomial family taken to be ) Null deviance: 596. on 46 degrees of freedom Residual deviance: 485.44 on 457 degrees of freedom AIC: 495.44 Number of Fisher Scoring iterations: :, 8

Logistic regression ( 4.4) Interpretation of coefficients e.g. tobacco (measured in kg): coeff= 0.08 logit{p i (β)} = β T x i ; increase in one unit of x ij, say, leads to increase in logitp i of 0.08; increase in p i /( p i ) of exp(0.08) =.084. estimated s.e. 0.06, logitp i ± 0.06, exp(0.08 +.06), exp(0.08.06)) is (.0,.4). similarly for age: ˆβ j = 0.044; increased odds.045 for year increase prediction of new values to class or 0 according as ˆp > (<)0.5 :, 9

Logistic regression ( 4.4) generalize to K classes. multivariate logistic regression: log pr(y = x) pr(y = K x),..., log pr(y = K x) pr(y = K x). impose an ordering (polytomous regression: MASS p.??? and polr). multinomial distribution (related to neural networks: MASS p.??? and??) :, 0

Discriminant analysis ( 4.) y {,,..., K }, f c (x) = f (x y = c) =density of x in class c Bayes Theorem: pr(y = c x) = f (x y = c)π c f (x) c =,..., K associated classification rule: assign a new observation to class c if p(y = c x) > p(y = k x) (maximize the posterior probability) k c :,

Discriminant analysis ( 4.) x y = k N p (µ k, Σ k ) p(y = k x) π k ( π) p Σ k / exp (x µ k) T Σ k (x µ k) which is maximized by maximizing the log: max{log π k k log Σ k (x µ k) T Σ k (x µ k)} if we further assume Σ k = Σ, then max{log π k k (x µ k) T Σ (x µ k )} max{log π k k (x T Σ x x T Σ µ k µ k Σ x + µ T k Σ µ k )} max{log π k + x T Σ µ k k µt k Σ µ k } :,

Discriminant analysis ( 4.) Procedure: compute δ k (x) = log π k + x T Σ µ k µt k Σ µ k classify observation x to class c if δ c (x) largest (see Figure 4.5, left) Estimate unknown parameters π k, µ k, Σ: ˆπ k = N k N ˆµ k = y i =k x i N c K Σ = (x i ˆµ k )(x i ˆµ k ) T /(N K ) k= i:y i =k (see Figure 4.5,right) :,

Discriminant analysis ( 4.) Special case: classes Choose class if log ˆπ + x T ˆΣ ˆµ ˆµT ˆΣ ˆµ > log ˆπ + x T ˆΣ ˆµ ˆµT ˆΣ ˆµ, x T ˆΣ (ˆµ ˆµ ) > ˆµ ˆΣ ˆµ ˆµ ˆΣ ˆµ + log(n /N) log(n /N) Note it is often common to specify π k = /K in advance rather than estimating from the data If Σ k not all equal, the discriminant function δ k (x) defines a quadratic boundary; see Figure 4.6, left An alternative is to augment the original set of features with quadratic terms and use linear discriminant functions; see Figure 4.6, right :, 4

Discriminant analysis ( 4.) Another description of LDA ( 4.., 4..): Let W = within class covariance matrix (ˆΣ) B = between class covariance matrix Find a T X such that a T Ba is maximized and a T Wa minimized, i.e. a T Ba max a a T Wa equivalently max a T Ba subject to a T Wa = a Solution a, say, is the eigenvector of W B corresponding to the largest eigenvalue. This determines a line in R p. continue, finding a, orthogonal (with respect to W ) to a, which is the eigenvector corresponding to the second largest eigenvalue, and so on. :, 5

Discriminant analysis ( 4.) There are at most min(p, K ) positive eigenvalues. These eigenvectors are the linear discriminants, also called canonical variates. This technique can be useful for visualization of the groups. Figure 4. shows the st two canonical variables for a data set with 0 classes. ( 4..) write ˆΣ = UDU T, where U T U = I, D is diagonal (see p.87 for ˆΣ) X = D / U T X, with ˆΣ = I classification rule is to choose k if ˆµ k is closest (closest class centroid) only needs the K points ˆµ k, and the K dimension subspace to compute this, since remaining directions are orthogonal (in the X space) if K = can plot the first two variates (cf wine data) See p.9, Figures 4.4 and 4.8 (algorithm on p.9 finds the best in order, as described on the previous slide) :, 6

Discriminant analysis ( 4.) Notes 4. considers linear regression of 0, variable on several inputs (odd from a statistical point of view) how to choose between logistic regression and discriminant analysis? they give the same classification error on the heart data (is this a coincidence?) logistic regression and generalizations to K classes doesn t assume any distribution for the inputs discriminant analysis more efficient if the assumed distribution is correct warning: in 4. x and x i are p vectors, and we estimate β 0 and β, the latter a p vector in 4.4 they are (p + ) with first element equal to and β is (p + ). :, 7

R code for the wine data > library(mass) > wine.lda <- lda(class alcohol + malic + ash + alcil + mag + totphen + flav + nonflav + proanth + col + hue + dil + proline, data = wine) > wine.lda Call: lda.formula(class alcohol + malic + ash + alcil + mag + totphen + flav + nonflav + proanth + col + hue + dil + proline, data = wine) Prior probabilities of groups: 0.4607 0.988764 0.69669 Group means: alcohol malic ash alcil mag totphen flav nonflav.74475.00678.45559 7.079 06.90.84069.9879 0.90000.787.9676.44789 0.80 94.549.5887.080845 0.666.575.750.4708.4667 99.5.678750 0.78458 0.447500 proanth col hue dil proline.899 5.5805.0609.57797 5.79.608.08660.05687.7855 59.5070.554 7.9650 0.68708.6854 69.8958 Coefficients of linear discriminants: LD LD alcohol -0.409978 0.87790699 malic 0.6554596 0.057975 ash -0.6907556.458497486 alcil 0.54797889-0.46807654 mag -0.006496-0.000467565 totphen 0.6805068-0.087 flav -.6695-0.49998054 nonflav -.49588440 -.6095795 proanth 0.40968-0.070875776 :, col 0.5505570 0.506865 hue -0.880607 -.55644987 8

R code for the wine data -4-0 LD :, 9

Separating hyperplanes ( 4.5) assume two classes only; change notation so that y = ± use linear combinations of inputs to predict y y = { + as β 0 + x T β < 0 β 0 + x T β > 0 misclassification error D(β, β 0 ) = Σ i M y i (β 0 + x T i β) where M = {j; y j (β 0 + x T j β) < 0} note that D(β) > 0 and proportional to the size of β 0 + x T i β Can show that an algorithm to minimize D(β, β 0 ) exists and converges the the plane that separates y = + from y = if such a plane exists But it will cycle if no such plane exists and be very slow if the gap is small :, 0

Separating hyperplanes ( 4.5) Also if one plane exists there is likely many (Figure 4.) The plane that defines the largest gap is defined to be best can show that this needs to s.t. See Figure 4.5 min β 0,β β y i (β 0 + x T i β), i =,... N (4.44) the points on the edges (margin) of the gap called support points or support vectors; there are typically many fewer of these than original points this is the basis for the development of Support Vector Machines (SVM), more later sometimes add features by using basis expansions; to be discussed first in the context of regression (Chapter 5) :,