Optimization. September 4, 2018

Similar documents
Optimization. August 30, 2016

Lecture Notes on Linear Regression

Lecture 10 Support Vector Machines II

Chapter Newton s Method

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Generalized Linear Methods

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Feature Selection: Part 1

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Topic 5: Non-Linear Regression

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

APPENDIX A Some Linear Algebra

IV. Performance Optimization

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Singular Value Decomposition: Theory and Applications

10-701/ Machine Learning, Fall 2005 Homework 3

MMA and GCMMA two methods for nonlinear optimization

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Lecture 21: Numerical methods for pricing American type derivatives

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Review of Taylor Series. Read Section 1.2

OPTIMISATION. Introduction Single Variable Unconstrained Optimisation Multivariable Unconstrained Optimisation Linear Programming

1 Convex Optimization

Introduction to the R Statistical Computing Environment R Programming

4DVAR, according to the name, is a four-dimensional variational method.

Inexact Newton Methods for Inverse Eigenvalue Problems

EEE 241: Linear Systems

Kernel Methods and SVMs Extension

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Chapter 4: Root Finding

Linear Approximation with Regularization and Moving Least Squares

Lecture 2 Solution of Nonlinear Equations ( Root Finding Problems )

Estimation: Part 2. Chapter GREG estimation

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Errors for Linear Systems

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Supporting Information

Maximum Likelihood Estimation (MLE)

EM and Structure Learning

Some modelling aspects for the Matlab implementation of MMA

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Classification as a Regression Problem

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Assortment Optimization under MNL

4.3 Poisson Regression

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

Lecture 2: Prelude to the big shrink

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Lecture 20: November 7

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Newton s Method for One - Dimensional Optimization - Theory

: Numerical Analysis Topic 2: Solution of Nonlinear Equations Lectures 5-11:

The Geometry of Logit and Probit

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Linear Feature Engineering 11

Designing a Pseudo R-Squared Goodness-of-Fit Measure in Generalized Linear Models

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Which Separator? Spring 1

The Expectation-Maximization Algorithm

STAT 405 BIOSTATISTICS (Fall 2016) Handout 15 Introduction to Logistic Regression

CHAPTER 7 CONSTRAINED OPTIMIZATION 2: SQP AND GRG

Solutions Homework 4 March 5, 2018

Maximum Likelihood Estimation

Primer on High-Order Moment Estimators

Problem Set 9 Solutions

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

1 GSW Iterative Techniques for y = Ax

VECTORS AND MATRICES:

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Global Sensitivity. Tuesday 20 th February, 2018

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Logistic Regression Maximum Likelihood Estimation

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 5 September 14, Readings: Bishop, 3.1

Week 5: Neural Networks

The exam is closed book, closed notes except your one-page cheat sheet.

Lecture 4: Universal Hash Functions/Streaming Cont d

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Maximal Margin Classifier

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Rockefeller College University at Albany

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

ECE559VV Project Report

The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Randomness and Computation

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

STATIC OPTIMIZATION: BASICS

INTRODUCTION TO DOUBLE GENERALIZED LINEAR MODELS. Edilberto Cepeda-Cuervo

Transcription:

Optmzaton September 4, 2018

Optmzaton problem 1/34 An optmzaton problem s the problem of fndng the best soluton for an objectve functon. Optmzaton method plays an mportant role n statstcs, for example, to fnd maxmum lkelhood estmate (MLE). Unconstraned vs. constraned optmzaton problem: whether there s constrant n the soluton space. Most algorthms are based on teratve procedures. We ll spend next few lectures on several optmzaton methods, under the context of statstcs: New-Raphson, Fsher scorng, etc. EM and MM. Hdden Markov models. Lnear and quadratc programmng.

Revew: Newton-Raphson (NR) method 2/34 Goal: Fnd the root for equaton f (θ) = 0. Approach: 1. Choose an ntal value θ (0) as the startng pont. 2. By Taylor expanson at θ (0), we have f (θ) = f (θ (0) ) + f (θ (0) )(θ θ (0) ). Set f (θ) = 0 gves an update of the parameter: θ (1) = θ (0) f (θ (0) )/ f (θ (0) ). 3. Repeated update untl convergence: θ (k+1) = θ (k) f (θ (k) )/ f (θ (k) ).

NR method convergence rate 3/34 Quadratc convergence: θ s the soluton. lm k θ (k+1) θ θ (k) θ 2 = c (rate = c > 0, order = 2) The # of sgnfcant dgts nearly doubles at each step (n the neghborhood of θ ). Proof: By Taylor expanson (to the second order) at θ (k), 0 = f (θ ) = f (θ (k) ) + f (θ (k) )(θ θ (k) ) + 1 2 f (ξ (k) )(θ θ (k) ) 2, ξ (k) [θ, θ (k) ] Dvdng the equaton by f (θ (k) ) gves f (θ (k) )/ f (θ (k) ) (θ θ (k) ) = f (ξ (k) ) 2 f (θ (k) ) (θ θ (k) ) 2. The defnton of θ (k+1) = θ (k) f (θ (k) )/ f (θ (k) ) gves What condtons are needed? f (θ (k) ) 0 n the neghborhood of θ f (ξ (k) ) s bounded θ (k+1) θ = f (ξ (k) ) 2 f (θ (k) ) (θ θ (k) ) 2. Startng pont s suffcently close to the root θ

Revew: maxmum lkelhood 4/34 Here s a lst of some defntons related to maxmum lkelhood estmate: Parameter θ, a p-vector Data X Log lkelhood l(θ) = log Pr(X θ) Score functon l(θ) = ( l/ θ 1,..., l/ θ p ) Hessan matrx l(θ) = { 2 l/ θ θ j }, j=1,...,p Fsher nformaton I(θ) = E l(θ) = E l(θ){ l(θ)} Observed nformaton l(ˆθ) When θ s a local maxmum of l, l(θ ) = 0, and l(θ ) s negatve defnte.

Applcaton of NR method n MLE: when θ s a scalar 5/34 Maxmum Lkelhood Estmaton (MLE): ˆθ = arg max θ l(θ). Approach Fnd ˆθ such that l(ˆθ) = 0. If the closed form soluton for l(ˆθ) = 0 s dffcult to obtan, one can use NR method (replace f by l). The the NR update for solvng MLE s: θ (k+1) = θ (k) l(θ (k) )/ l(θ (k) ).

What can go wrong? 6/34 Bad startng pont May not converge to the global maxmum Saddle pont: l(ˆθ) = 0, but l(ˆθ) s nether negatve defnte nor postve defnte (statonary pont but not a local extremum; can be used to check the lkelhood) startng pont & local extremum saddle pont saddle pont l(θ) = θ 3 l(θ 1, θ 2 ) = θ1 2 θ2 2

Generalzaton to hgher dmensons: when θ s a vector 7/34 General Algorthm 1. (Startng pont) Pck a startng pont θ (0) and let k = 0 2. (Iteraton) Determne the drecton d (k) (a p-vector) and the step sze α (k) (a scalar) and calculate θ (k+1) = θ (k) + α (k) d (k), such that l(θ (k+1) ) > l(θ (k) ) 3. (Stop crtera) Stop teraton f l(θ (k+1) ) l(θ (k) ) /( l(θ (k) ) + ɛ 1 ) < ɛ 2 or θ k+1, j θ k, j /( θ k, j + ɛ 1 ) < ɛ 2, j = 1,..., p for precsons such as ɛ 1 = 10 4 and ɛ 2 = 10 6. Otherwse go to 2. Key: Determne the drecton and the step sze

Generalzaton to hgher dmensons (contnued) 8/34 Determnng the drecton (general framework, detals later) We generally pck d (k) = R 1 l(θ (k) ), where R s a postve defnte matrx. Choosng a step sze (gven the drecton) Step halvng To fnd α (k) such that l(θ (k+1) ) > l(θ (k) ) Start at a large value of α (k). Halve α (k) untl l(θ (k+1) ) > l(θ (k) ) Smple, robust, but relatvely slow Lnear search To fnd α (k) = arg max α l(θ (k) + αd (k) ) Approxmate l(θ (k) + αd (k) ) by dong a polynomal nterpolaton and fnd α (k) maxmzng the polynomal Fast

Polynomal nterpolaton 9/34 Gven a set of p + 1 data ponts from the functon f (α) l(θ (k) + αd (k) ), we can fnd a unque polynomal wth degree p that goes through the p + 1 data ponts. (For a quadratc approxmaton, we only need 3 data ponts.)

Survey of basc methods 10/34 1. Steepest ascent: R = I = dentty matrx d (k) = l(θ (k) ) α (k) = arg max α l(θ(k) + α l(θ (k) )) or a small fxed number θ (k+1) = θ (k) + α (k) l(θ (k) ) Why l(θ (k) ) s the steepest ascent drecton? By Taylor expanson at θ (k), By Cauchy-Schwarz nequalty, l(θ (k) + ) l(θ (k) ) = T l(θ (k) ) + o( ) T l(θ (k) ) l(θ (k) ) The equalty holds at = α l(θ (k) ). So when = α l(θ (k) ), l(θ (k) + ) ncreases the most. Easy to mplement; only requre the frst dervatve/gradent/score Guarantee an ncrease at each step no matter where you start Converge slowly. The drectons of two consecutve steps are orthogonal, so the algorthm zgzags to the maxma.

Steepest ascent (contnued) 11/34 When α (k) s chosen as arg max α l(θ (k) + α l(θ (k) )), the drectons of two consecutve steps are orthogonal,.e., [ l(θ (k) )] T l(θ (k+1) ) = 0. Proof: By the defnton of α (k) and θ (k+1) 0 = l(θ(k) + α l(θ (k) )) α = l(θ (k) + α (k) l(θ (k) )) T l(θ (k) ) = l(θ (k+1) ) T l(θ (k) ). α=α (k)

Example: Steepest Ascent 12/34 Maxmze the functon f (x) = 6x x 3 Y 6 4 2 0 2 4 6 2 1 0 1 2 X

Example: Steepest Ascent (cont.) 13/34 fun0 <- functon(x) return(- xˆ3 + 6*x) grd0 <- functon(x) return(- 3*xˆ2 + 6) # target functon # gradent # Steepest Ascent Algorthm Steepest_Ascent <- functon(x, fun=fun0, grd=grd0, step=0.01, kmax=1000, tol1=1e-6, tol2=1e-4) { dff <- 2*x # use a large value to get nto the followng "whle" loop k <- 0 # count teraton whle ( all(abs(dff) > tol1*(abs(x)+tol2) ) & k <= kmax) # stop crtera { g_x <- grd(x) # calculate gradent usng x dff <- step * g_x # calculate the dfference used n the stop crtera x <- x + dff # update x k <- k + 1 # update teraton } f_x = fun(x) } return(lst(teraton=k, x=x, f_x=f_x, g_x=g_x))

Example: Steepest Ascent (cont.) 14/34 > Steepest_Ascent(x=2, step=0.01) $teraton [1] 117 $x [1] 1.414228 $f_x [1] 5.656854 $g_x [1] -0.0001380379 > Steepest_Ascent(x=1, step=-0.01) $teraton [1] 159 $x [1] -1.414199 $f_x [1] -5.656854 $g_x [1] 0.0001370128

In large dataset 15/34 The data log-lkelhood s usually summed over n observatons: l(θ) = n =1 l(x ; θ). When n s large, ths poses computatonal burden. One can mplement a stochastc verson of the algorthm: stochastc gradent descent (SGD). Note: Gradent descent s just steepest descent. Smple SGD algorthm: replace the gradent l(θ) by the gradent computed from a sngle sample l(x ; θ), where x s randomly sampled. Mn-batch SGD algorthm: compute the gradent based on a small number of observatons. Advantage of SGD: Evaluate gradent at one (or a few) observatons, requres less memory. Has better property to escape from local mnmum (gradent s nosy). Dsdvantage of SGD: Slower convergence.

Survey of basc methods (contnued) 16/34 2. Newton-Raphson: R = l(θ (k) ) = observed nformaton d (k) = [ l(θ (k) )] 1 l(θ (k) ) θ (k+1) = θ (k) + [ l(θ (k) )] 1 l(θ (k) ) α (k) = 1 for all k Fast, quadratc convergence Need very good startng ponts Theorem: If R s postve defnte, the equaton set Rd (k) = l(θ (k) ) has a unque soluton for the drecton d (k), and the drecton ensures ascent of l(θ). Proof: When R s postve defnte, t s nvertble. So we have a unque soluton d (k) = R 1 l(θ (k) ). Let θ (k+1) = θ (k) + αd (k) = θ (k) + αr 1 l(θ (k) ). By Taylor expanson, l(θ (k+1) ) l(θ (k) ) + α l(θ (k) ) T R 1 l(θ (k) ). The postve defnte matrx R ensures that l(θ (k+1) ) > l(θ (k) ) for suffcently small postve α.

Newton-Raphson vs. steepest ascent 17/34 Newton-Raphson converges much faster than steepest ascent (gradent descent). NR requres the computaton of second dervatve, whch can be dffcult and computatonally expensve. In contrast, gradent descent requres only the frst dervatve, whch s easy to compute. For poorly behaved objectve functon (non-convex), gradent-based methods are often more stable. Gradent-based method (especally SGD) s wdely used n modern machne learnng.

Example: Newton Raphson 18/34 fun0 <- functon(x) return(- xˆ3 + 6*x) grd0 <- functon(x) return(- 3*xˆ2 + 6) hes0 <- functon(x) return(- 6*x) # target functon # gradent # Hessan # Newton-Raphson Algorthm Newton_Raphson <- functon(x, fun=fun0, grd=grd0, hes=hes0, kmax=1000, tol1=1e-6, tol2=1e-4) { dff <- 2*x k <- 0 whle ( all(abs(dff) > tol1*(abs(x)+tol2) ) & k <= kmax) { g_x <- grd(x) h_x <- hes(x) # calculate the second dervatve (Hessan) dff <- -g_x/h_x # calculate the dfference used by the stop crtera x <- x + dff k <- k + 1 } f_x = fun(x) } return(lst(teraton=k, x=x, f_x=f_x, g_x=g_x, h_x=h_x))

Example: Newton Raphson 19/34 > Newton_Raphson(x=2) $teraton [1] 5 $x [1] 1.414214 $f_x [1] 5.656854 $g_x [1] -1.353229e-11 $h_x [1] -8.485281 > Newton_Raphson(x=1) $teraton [1] 5 $x [1] 1.414214 $f_x [1] 5.656854 $g_x [1] -1.353229e-11 $h_x [1] -8.485281

Survey of basc methods (contnued) 20/34 3. Modfcaton of Newton-Raphson Fsher scorng: replace l(θ) wth E l(θ) E l(θ) = E l(θ) l(θ) s always postve and stablze the algorthm E l(θ) can have a smpler form than l(θ) Newton-Raphson and Fsher scorng are equvalent for parameter estmaton n GLM wth canoncal lnk. Quas-Newton: aka varable metrc methods or secant methods. Approxmate l(θ) n a way that avods calculatng Hessan and ts nverse has convergence propertes smlar to Newton

Fsher Scorng: Example 21/34 In the Posson regresson model of n subjects, The responses Y Posson(λ ) = (Y!) 1 λ Y e λ. We know that λ = E(Y X ). We relate the mean of Y to X by g(λ ) = X β. Takng dervatve on both sdes, g (λ ) λ β = X λ β = X g (λ ) Log lkelhood: l(β) = n =1 (Y log λ λ ), where λ s satsfy g(λ ) = X β. Maxmum lkelhood estmaton: ˆβ = arg max β l(β) Newton-Raphson needs ( ) Y λ ( ) l(β) = 1 λ β = Y 1 1 λ g (λ ) X Y λ 1 ( ) l(β) = λ 2 β g (λ ) X Y g (λ ) λ 1 λ g (λ ) 2 β X 1 1 ( ) Y 1 1 = λ g (λ ) 2X2 1 λ λ g (λ ) 2X2 ( Y λ 1 ) g (λ ) g (λ ) 3X2

Fsher Scorng: Example (contnued) 22/34 Fsher scorng needs l(β) and E [ l(β) ] = whch s l(β) wthout the extra terms. 1 1 λ g (λ ) 2X2 Wth the canoncal lnk for Posson regresson: we have g (λ ) = λ 1 g(λ ) = log λ, and g (λ ) = λ 2. So that the extra terms equal to zero (check ths!) and we conclude that Newton-Raphson and Fsher scorng are equvalent.

Quas-Newton 23/34 1. Davdson-Fletcher-Powell QNR algorthm Let l (k) = l(θ (k) ) l(θ (k 1) ) and θ (k) = θ (k) θ (k 1). Approxmate negatve Hessan by Use the startng matrx G (0) = I. G (k+1) = G (k) + θ(k) ( θ (k) ) T ( θ (k) ) T θ (k) G(k) l (k) ( l (k) ) T G (k) ( l (k) ) T G (k) l (k). Theorem: If the startng matrx G (0) s postve defnte, the above formula ensures that every G (k) durng the teraton s postve defnte.

Nonlnear Regresson Models 24/34 Data: (x, y ) for = 1,..., n Notaton and assumptons Model: y = h(x, β) + ɛ, where ɛ..d N(0, σ 2 ) and h(.) s known Resdual: e (β) = y h(x, β) Jacoban: {J(β)} j = h(x,β) β j = e (β) β j, a n p matrx Goal: to obtan MLE ˆβ = arg mn β S (β), where S (β) = {y h(x, β)} 2 = [e(β)] T e(β) We could use the prevously-dscussed Newton-Raphson algorthm. Gradent: g j (β) = S (β) β j = 2 e (β) e (β) β j,.e., g(β) = 2J(β) T e(β) Hessan: H jr (β) = 2 S (β) β j β r = 2 {e (β) 2 e (β) Problem: Hessan could be hard to obtan. β j β r + e (β) β j e (β) β r }

Gauss-Newton algorthm 25/34 Recall n lnear regresson models, we mnmze { S (β) = y x T β} 2 Because S (β) s a quadratc functon, t s easy to get MLE ˆβ = x x 1 T x y Now n the nonlnear regresson models, we want to mnmze S (β) = {y h(x, β)} 2 Idea: Approxmate h(x, β) by a lnear functon, teratvely at β (k) Gven β (k) and by Taylor expanson of h(x, β) at β (k), S (β) becomes { S (β) y h(x, β (k) ) (β β (k) ) T h(x, β (k) ) β } 2

Gauss-Newton algorthm (cont.) 26/34 1. Fnd a good startng pont β (0) 2. At step k + 1, (a) Form e(β (k) ) and J(β (k) ) (b) Use a standard lnear regresson routne to obtan δ (k) = [J(β (k) ) T J(β (k) )] 1 J(β (k) ) T e(β (k) ) (c) Obtan the new estmate β (k+1) = β (k) + δ (k) Don t need computng Hessan matrx. Need good startng values. Requre J(β (k) ) T J(β (k) ) to be nvertble. Ths s not a general optmzaton method. Only applcable to lease square problem.

Example: Generalzed lnear models (GLM) 27/34 Data: (y, x ) for = 1,..., n Notaton and assumptons Mean: E(y x) = µ Lnk g: g(µ) = x β Varance functon V: Var(y x) = φv(µ) Log lkelhood (exponental famly): l(θ, φ; y) = {yθ b(θ)}/a(φ) + c(y, φ) We obtan Score functon: l = {y b (θ)}/a(φ) Observed nformaton: l = b (θ)/a(φ) Mean (n term of θ): E(y x) = a(φ)e( l) + b (θ) = b (θ) Varance (θ, φ): Var(y x) = E(y b (θ)) 2 = a(φ) 2 E( l l ) = a(φ) 2 E( l) = b (θ)a(φ) Canoncal lnk: g such that g(µ) = θ,.e. g 1 = b Generally we have a(φ) = φ/w, n whch case φ wll drop out of the followng.

GLM 28/34 Model Normal Posson Bnomal Gamma φ σ 2 1 1/m 1/ν b(θ) θ 2 /2 exp(θ) log(1 + e θ ) log( θ) µ θ exp(θ) e θ /(1 + e θ ) 1/θ Canoncal lnk g dentty log logt recprocal Varance functon V 1 µ µ(1 µ) µ 2

Towards Iteratvely reweghted least squares 29/34 In lnear regresson models, E(y x ) = x T β, so we mnmze { S (β) = y x T β} 2 Because S (β) s a quadratc functon, t s easy to get MLE ˆβ = x x 1 T x y In generalzed lnear models, consder construct a smlar quadratc functon S (β). Queston? Can we use Answer: No, because { S (β) = g(y ) x T β} 2 E {g(y ) x } x T β Idea: Approxmate g(y ) by a lnear functon wth expectaton x T β(k), nteractvely at β (k)

Iteratvely reweghted least squares 30/34 Lnearze g(y ) around ˆµ (k) Check the varances of ỹ (k) W (k) = g 1 (x T β(k) ), denote the lnearzed value by ỹ (k). ỹ (k) = g(ˆµ (k) ) + (y ˆµ (k) )g (ˆµ (k) and use them as weghts = { Var(ỹ (k) Gven β (k), we consder mnmze IRLS algorthm: S (β) = ) ) } 1 [ = {g (ˆµ (k) )} 2 V(ˆµ (k) ) ] 1 W (k) 1. Start wth ntal estmates, generally ˆµ (0) 2. Form ỹ (k) and W (k) 3. Estmate β (k+1) by regressng ỹ (k) {ỹ(k) x T β} 2 = y on x wth weghts W (k) 4. Form ˆµ (k+1) = g 1 (x T β(k+1) ) and return to step 2.

Iteratvely reweghted least squares (contnued) 31/34 Model Posson Bnomal Gamma µ = g 1 (η) e η e η /(1 + e η ) 1/η g (µ) 1/µ 1/[µ(1 µ)] 1/µ 2 V(µ) µ µ(1 µ) µ 2 McCullagh and Nelder (1983) justfed IRLS by showng that IRLS s equvalent to Fsher scorng. In the case of the canoncal lnk, IRLS s also equvalent to Newton-Raphson. IRLS s attractve because no specal optmzaton algorthm s requred, just a subroutne that computes weghted least square estmates.

Mscellaneous thngs 32/34 Dsperson parameter: When we do not take φ = 1, the usual estmate s va the method of moments: ˆφ = 1 (y ˆµ ) 2 n p V(ˆµ ) Standard errors: Var(ˆβ) = ˆφ(X ŴX) 1 Quas lkelhood: Pck a lnk and a varance functon, and IRLS can proceed wthout worryng about the model. In other words, IRLS s a good thng!

A quck revew 33/34 Optmzaton method s mportant n statstcs, (.e., to fnd MLE), or n general machne learnng (mnmze some loss functon). Maxmzng/mnmzng an objectve functon s acheved by solvng the equaton that the frst dervatve s 0 (need to check second dervatve). Steepest ascent method: Only need gradent. Slow convergence. In large dataset wth ll-behaved objectve functon, stochastc verson (SGD) usually works better.

Newton-Raphson (NR) method: Quadratc convergence rate. Could stuck n local maxmum. In hgher dmenson, the problems are to fnd drectons and step szes n each teraton. Fsher scorng: use expected nformaton matrx. NR use observed nformaton matrx. The expected nformaton s more stable and smpler. Fsher scorng and Newton-Raphson are equvalent under canoncal lnk. Gauss-Newton algorthm for non-lnear regresson: Hessan matrx s not needed.