Generalized Linear Methods

Similar documents
Ensemble Methods: Boosting

Lecture Notes on Linear Regression

Feature Selection: Part 1

CSE 546 Midterm Exam, Fall 2014(with Solution)

1 Convex Optimization

Singular Value Decomposition: Theory and Applications

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Linear Approximation with Regularization and Moving Least Squares

CSC 411 / CSC D11 / CSC C11

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

10-701/ Machine Learning, Fall 2005 Homework 3

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Boostrapaggregating (Bagging)

PHYS 705: Classical Mechanics. Calculus of Variations II

Kernel Methods and SVMs Extension

EEE 241: Linear Systems

Lecture 20: November 7

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Chapter Newton s Method

Homework Assignment 3 Due in class, Thursday October 15

Week 5: Neural Networks

Chapter 9: Statistical Inference and the Relationship between Two Variables

Linear Feature Engineering 11

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Lecture 10 Support Vector Machines II

Online Classification: Perceptron and Winnow

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Excess Error, Approximation Error, and Estimation Error

1 The Mistake Bound Model

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Estimation: Part 2. Chapter GREG estimation

SDMML HT MSc Problem Sheet 4

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Which Separator? Spring 1

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Learning Theory: Lecture Notes

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Classification as a Regression Problem

Multilayer Perceptron (MLP)

4DVAR, according to the name, is a four-dimensional variational method.

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

1 Definition of Rademacher Complexity

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

COS 511: Theoretical Machine Learning

Laboratory 1c: Method of Least Squares

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

MAXIMUM A POSTERIORI TRANSDUCTION

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Errors for Linear Systems

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Introduction to Regression

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Laboratory 3: Method of Least Squares

Analytical Chemistry Calibration Curve Handout

MMA and GCMMA two methods for nonlinear optimization

Assortment Optimization under MNL

1 Matrix representations of canonical matrices

Economics 130. Lecture 4 Simple Linear Regression Continued

Multilayer neural networks

Société de Calcul Mathématique SA

Lecture 4: Universal Hash Functions/Streaming Cont d

Lecture 10 Support Vector Machines. Oct

Support Vector Machines

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Manning & Schuetze, FSNLP (c)1999, 2001

IV. Performance Optimization

Computational and Statistical Learning theory Assignment 4

Support Vector Machines

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Supporting Information

3.1 ML and Empirical Distribution

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Chapter 11: Simple Linear Regression and Correlation

The exam is closed book, closed notes except your one-page cheat sheet.

Problem Set 9 Solutions

Some modelling aspects for the Matlab implementation of MMA

FREQUENCY DISTRIBUTIONS Page 1 of The idea of a frequency distribution for sets of observations will be introduced,

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

Radial-Basis Function Networks

Maximum Likelihood Estimation

Logistic Regression Maximum Likelihood Estimation

Mean Field / Variational Approximations

Lecture 2: Prelude to the big shrink

Numerical Heat and Mass Transfer

Report on Image warping

Gaussian Mixture Models

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Linear, affine, and convex sets and hulls In the sequel, unless otherwise specified, X will denote a real vector space.

Transcription:

Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set of weak learners, {g t (x)} T,.e. we have traned these learners on the tranng data {(x, y )} n =1, whch correspond to the weghts {w } n =1. We can create a strong model, as a combnaton of all these weak learners va creatng a lnear ( combnaton of them. We defne T ) the weghted output as G T (x) = sgn α tg t (x), where the sgn(.) s the sgn functon. One of the most famous method for such boostng based data s called AdaBoost. Based on AdaBoost one could develop methods for gradent boostng, especally for trees. Here we frst explan the AdaBoost. 2 AdaBoost Manly ntroduced n [1], s consdered as one of the most mportant steps n the Statstcal Learnng. One could show that AdaBoost s mnmzng an upper bound on the emprcal error. To be accurate, the emprcal error (for a classfcaton) s defned as the followng: ˆP (Y G T (X) 0) = 1 n { : G(x ) y } whch can be upper-bounded by the multplcaton of the normalzaton constants: ˆP (Y G T (X) 0) Z t. Or Boostng Technques, Ensemble Methods,... 1

Algorthm 1: Adaboost. Input: The set of weak learners, {g t (x)} T Output: The weghts of the generalzed learner {α t } T Intalze the weghts, w (1) = 1 n, = 1,..., n for t = 1 to T do Ft the learner g t (x) to the tranng data {(x, y )} n =1, weghted { by } n =1. Choose α t R; w (t+1) normalzaton factor. Return the output G(x). exp[ α ty g t(x )] Z t, for = 1,..., n, where Z t s a One unanswered queston by now s that, how do we choose α t? One opton s to choose t n a greedy fashon to make Z t (α) at each step. Snce Z t (α) s a convex functon, t has a unque mnmum. Gven that g t (x) {±1}, the greedy choce of α t would gve the followng answer: ɛ t I (y g t (x )) (1) α t 1 2 log 1 ɛ t ɛ t (2) Wth ths choce we can fnd a guarantee on the emprcal error bound of ths algorthm, as stated by the next algorthm. Theorem 1. The gven procedure n Algorthm 1, wth the choce of α t n the Equaton 1, f ɛ t 1 2 for all t wll result n the followng bound: ˆP (Y G T (X) 0) δ (3) for any arbtrary > 0, f T s bg enough. More accurately f T ln 1/δ 2 2. Now we wll provde the proof of Theorem 1 to the next secton. If you don t feel lke readng a relatvely borng proof, you can skp t to the next secton! We chunk the proof nto smaller sectons. Frst we prove the followng lemma. 2

Lemma 1. The gven procedure n Algorthm 1, wth the choce of α t n Equaton 1, wll result n the followng bounds: ˆP (Y G T (X) 0) = 1 n { : G(x ) y } Z t. (4) Proof. The equvalent event to Y G T (X) 0 s exp ( Y G T (X)) 1. It s easy to see that: ˆP (Y G T (X) 0) = 1 n I {G(x ) y =1 } (5) n Usng the updates of w (t+1) whch would gve: 1 n 1 n 1 n =1 n exp ( y G(x )) (6) =1 ( n exp y =1 n =1 T α t g t (x ) ) (7) exp ( y α t g t (x )) (8) n the Algorthm 2, we have: w (t+1) exp ( y α t g t (x )) = Z t n ˆP (Y G T (X) 0) 1 n =1 ( T ) = Z t w (t+1) Z t n 1 n =1 (T +1) w w (1) (9) (10) Snce we chose w (1) (T +1) = 1/n, and w s a proper probablty dstrbuton, ( T n (T +1) ˆP (Y G T (X) 0) Z t) w (11) =1 Z t. (12) 3

Lemma 2. The gven procedure n Algorthm 1, wth the choce of α t n Equaton 1, the greedy choce of α t to mnmze Z t wll result n, and where, ɛ t α t 1 2 log 1 ɛ t ɛ t Z t = 2 ɛ t (1 ɛ t ) I (y g t (x )) Proof. The proof s easy. constant explctly: Let s wrte the defnton of the normalzaton Z t = e αty g t(x ) = = (1 ɛ t )e αt + ɛ t e αt :y =g t(x ) e αt + :y g t(x ) e αt Z t s a convex functon of α t and has unque mnmum. By talkng dervatve we can fnd the mnmzer, whch s the followng: α t = 1 2 log 1 ɛ t ɛ t By pluggng ths nto the defnton of Z t, we wll fnd ts mnmum value: Z t = 2 ɛ t (1 ɛ t ) Lemma 3. The results of Lemma 1 and Lemma 2, and gven that ɛ t 1 2, t, the bound on the emprcal error s not more than, ( 1 4 2) T/2 (where (0, 0.5)). Proof. By the end of Lemma 1 and Lemma 2, we have proven that: ˆP (Y G T (X) 0) 2 ɛ t (1 ɛ t ) Gven that (0, 0.5), we can show that, ˆP (Y G T (X) 0) ( 1 4 2) T/2 4

Now we have everythng we needed for the proof of the Theorem 1. Proof of the Theorem 1. Gven the result of the Lemma 3, we have ˆP (Y G T (X) 0) ( 1 4 2) T/2 Now defne whch s equvalent to δ ( 1 4 2) T/2 T ln 1/δ 2 2 2.1 AdaBoost as mnmzng a global objectve functon One alternate vew to what mentoned above s mnmzng a global objectve functon, wth coordnate-descent (greedy) updates. It can be shown that the global objectve s the followng: L = 1 n e y t αtft(x ) when optmzng locally wth respect to α t. 2.2 More general AdaBoost As shown prevously, the standard form of AdaBoost can be nterpreted as mnmzng an exponental loss functon. One can show a general form of AdaBoost, on arbtrary loss functon, but due to computatonal cost these general forms are not very popular. 3 Gradent Boostng Usng the boostng methods, new methods are proposed for Gradent Boostng methods. In fact, the gradent boostng methods, are usng both of boostng and gradent methods, especally gradent descent (n the functonal level). Usng only gradent methods have cones, e.g. negatve effect on generalzaton error and local optmzaton. However, n glm (a package n R language) suggests selectng a class of functons that uses the covarate nformaton to approxmate the gradent, usually a regresson tree. The 5

algorthm s shown n the Algorthm 3. At each teraton the algorthm determnes the gradent, the drecton n whch t needs to mprove the approxmaton to ft to the data, by selectng from a class of functons. In other words, t selects a functon whch has the most agreement wth the current approxmaton error. Just to remnd you what problem s formally, we want to fnd a functon F such that: G (x) = arg mn E x,y [L(y, F (x))] = arg mn L(y, F (x))p(x, y)dxdy G G x,y But n practce the true dstrbuton P(x, y) s not known, and nstead we have samples of t, n the form of D = {(x, y )} n =1. Also the set of functons we can choose from s also lmted, whch we represent wth G. Thus the problem s approxmated n the followng form: G (x) arg mn L(y, F (x )) G G (x,y ) D Suppose a good approxmaton could be wrtten as a lnear combnaton of some coarse approxmatons: T α t g t (x) Suppose the followng s our ntal approxmaton, wth a constant functon: G 0 (x) = arg mn L(y, α) α (x,y ) D Followed by the ncremental approxmatons: G m (x) = G m 1 (x) + arg mn L(y, G m (x ) + g(x )) g G (x,y ) D Snce the mnmzaton n the prevous equaton s done over functons (functonal mnmzaton) t s relatvely hard to solve. Instead we can approxmate t wth greedy (functonal-)gradent based updates. The negatve functonal-gradent of the loss functon: g L(y, G(x) + g(x)) 6

s the drecton n whch loss functons has the most decrease. Thus followng updates, approprate choce of step sze wll result n reducton: G m (x) = G m 1 (x) m g L(y, G m 1 (x )) x D One possble way to fnd the step sze s va lne search: m = arg mn L (y, G m 1 (x ) g L(y, G m 1 (x ))) x D Algorthm 2: Gradent boostng algorthm. Input: The set of weak learners, {g t (x)} T Output: The weghts of the generalzed learner {α t } T Intalze an approxmaton wth a constant g 0 (x) = arg mn L(y, ). for m = 1 to M do Compute the negatve gradent, r m = ( r 1m, r 2m,..., r nm ), such that r m = L(y, G(x )) G(x ) G(x)=Gm 1 (x) Ft the a functon h m (x) on the gradent resduals {(x, r m )} n =1. Fnd the scalng parameter m such that the followng objectve s mnmzed: m = arg mn L (y, G m 1 (x ) + h m (x)) x D Update, G m (x) = G m 1 (x) + m h m (x) Return the output G M (x). Suppose we want to generalze the Algorthm 2 to trees. The only dfference s that, the approxmatng functon h m (x) s made of a tree, whch we can represent wth j b ji (x R j ), where I(.) s an ndcator functon whch shows whether the nput x belongs to a specfc regon R j or not, and b j s the predcton for the values n ths regon. In the followng step, a value of m s estmated va lne search. In the [2] t s suggested to use value for each regon. In other words change m = arg mn L (y, G m 1 (x ) + h m (x)) x D 7

to jm = arg mn x R jm L (y, G m 1 (x ) + b j ). Note that, n ths lne search, the values of {b j } M j=1 do not have any effect n the fnal result; the only thngs that matter are the set of regons {R j } M j=1. Thus we can smplfy t, and wrte t as the followng: jm = arg mn L (y, G m 1 (x ) + ). x R jm The overall algorthm s shown n Algorthm 3. Algorthm 3: Gradent boostng for trees. Input: The set of weak learners, {g t (x)} T Output: The weghts of the generalzed learner {α t } T Intalze a sngle-node tree G 0 (x) = arg mn L(y, ). for m = 1 to M do Compute the negatve gradent, r m = ( r 1m, r 2m,..., r nm ), such that r m = L(y, G(x )) G(x ) G(x)=Gm 1 (x) Ft the regresson to the pseudo-responses (resduals) {(x, r m )} n =1, whch results n termnal regons, R jm, j = 1,..., J m. for j = 1, 2, 3,..., J m do jm = arg mn x R jm L (y, G m 1 (x ) + ) Update, G m (x) = G m 1 (x) + j jmi (x R jm ) Return the output G M (x). In the glm lbrary of R, gven the scenaro shown n the Algorthm 3, the shrnkage parameter s the (or learnng rate) parameter n gradent updates. So the man effort n varable selecton s n selectng are the choce of n.trees and shrnkage parameters. 3.0.1 Regularzaton Snce the gradent boostng for trees has a bg degree of freedom, t s hghly prone to overfttng. One possble way to reduce the amount of overfttng s 8

shrnkage n the coeffcents. Suppose we have a parameter λ (0, 1), such that: G m (x) = G m 1 (x) + λ m h m (x) 4 Fnal notes Some ntuton s from Davd Forsyth s class at UIUC. Peter Bartlett s class notes provded a very good summary of the man ponts. References [1] Yoav Freund and Robert E Schapre. A descon-theoretc generalzaton of on-lne learnng and an applcaton to boostng. In Computatonal learnng theory, pages 23 37. Sprnger, 1995. [2] J. Fredman, T. Haste, and R. Tbshran. The elements of statstcal learnng, 2008. 9