Lecture 4: Newton s method and gradient descent

Similar documents
MATH 829: Introduction to Data Mining and Analysis Logistic regression

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Data Mining Stat 588

MS&E 226: Small Data

MS&E 226: Small Data

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

COMP 551 Applied Machine Learning Lecture 2: Linear regression

Introduction to Statistics and R

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

Lecture 5: Soft-Thresholding and Lasso

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Regularization Paths

Logistic Regression and Generalized Linear Models

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

A Significance Test for the Lasso

Regularization Paths. Theme

Nonnegative Garrote Component Selection in Functional ANOVA Models

Association studies and regression

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

ECS171: Machine Learning

Least Angle Regression, Forward Stagewise and the Lasso

STAT 462-Computational Data Analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

MATH 4211/6211 Optimization Basics of Optimization Problems

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Mallows Cp for Out-of-sample Prediction

Graphical Model Selection

Classification. Chapter Introduction. 6.2 The Bayes classifier

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Machine Learning Linear Models

Beyond GLM and likelihood

Lecture 4 Logistic Regression

Chapter 6. Ensemble Methods

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Sampling Distributions

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Bayesian Interpretations of Regularization

STK Statistical Learning: Advanced Regression and Classification

MAS223 Statistical Inference and Modelling Exercises

Data Mining Stat 588

Post-selection Inference for Forward Stepwise and Least Angle Regression

10-725/36-725: Convex Optimization Prerequisite Topics

STA 450/4000 S: January

Lecture 1: Supervised Learning

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Overfitting, Bias / Variance Analysis

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Warm up: risk prediction with logistic regression

The lasso: some novel algorithms and applications

Lecture 2 Machine Learning Review

Stat 642, Lecture notes for 04/12/05 96

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Logistic Regression. Seungjin Choi

Introduction to gradient descent

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

STRUCTURED VARIABLE SELECTION AND ESTIMATION

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Statistical Data Mining and Machine Learning Hilary Term 2016

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Lecture 4 Multiple linear regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lingmin Zeng a & Jun Xie b a Department of Biostatistics, MedImmune, One MedImmune Way,

Gaussian and Linear Discriminant Analysis; Multiclass Classification

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. An Extension of Least Angle Regression Based on the Information Geometry of Dually Flat Spaces

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Some new ideas for post selection inference and model assessment

A significance test for the lasso

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Chapter 17: Undirected Graphical Models

University of North Carolina at Chapel Hill

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Sampling Distributions

Learning From Data Lecture 9 Logistic Regression and Gradient Descent

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Sample size calculations for logistic and Poisson regression models

Logistic regression model for survival time analysis using time-varying coefficients

Classification Logistic Regression

Learning Binary Classifiers for Multi-Class Problem

Tutz, Binder: Boosting Ridge Regression

Subset selection with sparse matrices

Linear Regression (continued)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Kernel Logistic Regression and the Import Vector Machine

Self-adaptive Lasso and its Bayesian Estimation

Lecture 25: November 27

Sparse Linear Models (10/7/13)

A significance test for the lasso

Statistical Machine Learning Hilary Term 2018

Accelerated Block-Coordinate Relaxation for Regularized Optimization

High-dimensional data analysis

Lecture 13 and 14: Bayesian estimation theory

Transcription:

Lecture 4: Newton s method and gradient descent Newton s method Functional iteration Fitting linear regression Fitting logistic regression Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech

Newton s method for finding root of a function solve g(x) =0 iterative method: x n = x n 1 g(x n 1) g (x n 1 ) Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 1

Quadratic convergence of Newton s method let e n = x n x Newton s method has quadratic convergence lim n e n e 2 n 1 = 1 2 f (x ) linear convergence definition: if lim n e n e n 1 = f (x ) f: iteration function, 0 < f (x ) < 1 Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 2

Finding the maximum Finding the maximum of a function f(x) max x D f(x) first order condition f(x) =0 Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 3

Newton s method for finding maximum of a function g : R d R x R d [ ] T x 1 x n Newton s method for finding maximum x n = x n 1 [H(x n 1 )] 1 g(x n 1 ) gradient vector Hessian matrix [ g] i = dg(x) d[x] i [H(x)] ij = d2 g(x) d[x] i d[x] j Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 4

Gradient descent gradient descent for finding maximum of a function x n = x n 1 + µ g(x n 1 ) µ: step-size gradient descent can be viewed as approximating Hessian matrix as H(x n 1 )= I Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 5

Maximum likelihood θ: parameter,x: data log-likelihood function l(θ x) log f(x θ) ˆθ ML = arg max θ l(θ x) drop dependence on x, butrememberthatl(θ) is a function of data x Maximize the log-likelihood function by setting dl(θ) dθ =0 Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 8

Linear Regression n observations, 1 predictive variable, 1 response variable finding parameters a R,b R to minimize the mean square error (a, b) = arg min a,b 1 n n (y i ax i b) 2 i=1 â = S xy xȳ S xx ( x) 2, ˆb =ȳ â x Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 6

Multiple linear regression n observations, p predictors, 1 response variable finding parameters a R p, b to minimize the mean square error (a, b) = arg min a,b 1 n n y i i=1 p a j x ij b 2 j=1 â = Σ 1 xx Σ xy Σ xx and Σ xy are the same covariance matrices difficult when p>nand when p is large use iterative method (gradient descent etc.) Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 7

Prostate Cancer The data for this example come from a study by Stamey et al. (1989). They examined the correlation between the level of prostate-specific antigen and anumberofclinicalmeasuresinmenwhowereabouttoreceivearadical prostatectomy. The variables are log cancer volume (lcavol), log prostate weight (lweight), age, log of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), log of capsular penetration (lcp), Gleason score (gleason), and percent of Gleason scores 4 or 5 (pgg45). The correlation matrix of the predictors given in Table 3.1 shows many strong correlations. Figure 1.1 (page 3) of Chapter 1 is a scatterplot matrix showing every pairwise plot between the variables. We see that svi is a binary variable, and gleason is an ordered categorical variable. We see, for example, that both lcavol and lcp show a strong relationship with the response lpsa, andwitheachother.weneedtofittheeffects jointly to untangle the relationships between the predictors and the response.

TABLE 3.1. Correlations of predictors in the prostate cancer data. lcavol lweight age lbph svi lcp gleason lweight 0.300 age 0.286 0.317 lbph 0.063 0.437 0.287 svi 0.593 0.181 0.129 0.139 lcp 0.692 0.157 0.173 0.089 0.671 gleason 0.426 0.024 0.366 0.033 0.307 0.476 pgg45 0.483 0.074 0.276 0.030 0.481 0.663 0.757

TABLE 3.2. Linear model fit to the prostate cancer data. The Z score is the coefficient divided by its standard error (3.12). Roughly a Z score larger than two in absolute value is significantly nonzero at the p =0.05 level. Term Coefficient Std. Error Z Score Intercept 2.46 0.09 27.60 lcavol 0.68 0.13 5.37 lweight 0.26 0.10 2.75 age 0.14 0.10 1.40 lbph 0.21 0.10 2.06 svi 0.31 0.12 2.47 lcp 0.29 0.15 1.87 gleason 0.02 0.15 0.15 pgg45 0.27 0.15 1.74

Newton s method for fitting logistic regression model n observations (x i,y i ), i =1,,n parameter a R p, b predictor x i R p,labely i {0, 1}. p(y i =1 x i ) h(x i ; a, b) = 1 1+e a x b Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 9

South African Heart Disease Here we present an analysis of binary data to illustrate the traditional statistical use of the logistic regression model. The data in Figure4.12area subset of the Coronary Risk-Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa (Rousseauw et al., 1983). The aim of the study was to establish the intensity ofischemic heart disease risk factors in that high-incidence region. The data represent white males between 15 and 64, and the response variable is the presenceor absence of myocardial infarction (MI) at the time of the survey (the overall prevalence of MI was 5.1% in this region). There are 160 cases in our data set, and a sample of 302 controls. These data are described in more detail in Hastie and Tibshirani (1987). We fit a logistic-regression model by maximum likelihood, giving the

Example: South African heart disease data Example (from ESL section 4.4.2): there are n = 462 individuals broken up into 160 cases (those who have coronary heart disease) and 302 controls (those who don t). There are p =7variables measured on each individual: I sbp (systolic blood pressure) I tobacco (lifetime tobacco consumption in kg) I ldl (low density lipoprotein cholesterol) I famhist (family history of heart disease, present or absent) I obesity I alcohol I age 5

Pairs plot (red are cases, green are controls): 6

Fitted logistic regression model: The Z score is the coe cient divided by its standard error. There is a test for significance called the Wald test Just as in linear regression, correlated variables can cause problems with interpretation. E.g., sbp and obseity are not significant, and obesity has a negative sign! (Marginally, these are both significant and have positive signs) 7

After repeatedly dropping the least significant variable and refitting: This procedure was stopped when all variables were significant E.g., interpretation of tobacco coe cient: increasing the tobacco usage over the course of one s lifetime by 1kg (and keeping all other variables fixed) multiplies the estimated odds of coronary heart disease by exp(0.081) 1.084, or in other words, increases the odds by 8.4% 8

Functional iteration find a root of the equation g(x) =0 introduce f(x) =g(x)+x, g(x) =0 f(x) =x in many examples, iterates x n = f(x n 1 ) convergens to x = f(x ), x called fixed point of f(x) newton s method x n = f(x n 1 ) with f(x) =x g(x) g (x) Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 10

Convergence Theorem. Suppose the function f(x) defined on a closed interval I satisfies the conditions: 1. f(x) I whenever x I 2. f(y) f(x) λ y x for any two points x and y in I. Then, provided the Lipschiz constant λ is in [0, 1), f(x) has a unique fixed point x I, andx n = f(x n 1 ) converges to x regardless of starting point x 0 I. Furthermore,wehave Proof: x n x λn 1 λ x 1 x 0. Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 11

First consider x k+1 x k = f(x k ) f(x k 1 ) λ x k x k 1 λ k x 1 x 0 then for some m>n, x m x n m 1 k=n x k+1 x k m 1 k=n λ k x 1 x 0 λn 1 λ x 1 x 0 So {x n } forms a Cauchy sequence. Since I R is closed and bounded, it is compact. Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 12

By Theorem 3.11 (b) in Rubin 1,ifX is a compact metric space and if {x n } is a Cauchy sequence in X, then{x n } converges to some point of X. Hencex n x,whenn and x I. Nowsincef is continuous (it s in fact Lipschitz continuous), this means f(x )=x. Hence, x is a fixed point. Since fixed point is unique, then x = x. To prove the furthermore part, since x m x n λ n 1 λ x 1 x 0,wecan send m and have x x n λ n 1 λ x 1 x 0. 1 Walter Rudin, Principles of Mathematical Analysis. Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 13

Example g(x) = sin( x+e 1 π ), findx such that g(x) =0 f(x) =g(x)+x, f (x) = 1 π cos(x+e 1 π )+1 f (x) < 1 for [ e 1, π2 2 e 1 ]=[ 0.3679, 4.5669] (so we can apply function iteration to g(x) Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 14

let I =[ 0.3, 3], thenf(x) I whenever x I, andλ < 1 for f(x) on this range 3 2.5 2 f(x) 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 x Prof. Yao Xie, ISyE 6416, Computational Statistics, Georgia Tech 15