Boosting with log-loss

Similar documents
Combining Classifiers

Support Vector Machines MIT Course Notes Cynthia Rudin

Ch 12: Variations on Backpropagation

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Bayes Decision Rule and Naïve Bayes Classifier

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

List Scheduling and LPT Oliver Braun (09/05/2017)

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Support Vector Machines. Maximizing the Margin

Lecture 21. Interior Point Methods Setup and Algorithm

CS Lecture 13. More Maximum Likelihood

Kernel Methods and Support Vector Machines

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Support Vector Machines. Goals for the lecture

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Ensemble Based on Data Envelopment Analysis

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

1 Bounding the Margin

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Topic 5a Introduction to Curve Fitting & Linear Regression

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Machine Learning Basics: Estimators, Bias and Variance

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

1 Identical Parallel Machines

Pattern Recognition and Machine Learning. Artificial Neural networks

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

DEPARTMENT OF ECONOMETRICS AND BUSINESS STATISTICS

COS 424: Interacting with Data. Written Exercises

Variations on Backpropagation

Machine Learning: Fisher s Linear Discriminant. Lecture 05

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Bootstrapping Dependent Data

A Simple Regression Problem

Soft-margin SVM can address linearly separable problems with outliers

Pattern Recognition and Machine Learning. Artificial Neural networks

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

arxiv: v1 [cs.ds] 3 Feb 2014

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Introduction to Machine Learning. Recitation 11

Estimating Parameters for a Gaussian pdf

Sharp Time Data Tradeoffs for Linear Inverse Problems

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Non-Parametric Non-Line-of-Sight Identification 1

Variational Adaptive-Newton Method

Probabilistic Machine Learning

A note on the multiplication of sparse matrices

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

PAC-Bayes Analysis Of Maximum Entropy Learning

Homework 3 Solutions CSE 101 Summer 2017

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Order Recursion Introduction Order versus Time Updates Matrix Inversion by Partitioning Lemma Levinson Algorithm Interpretations Examples

paper prepared for the 1996 PTRC Conference, September 2-6, Brunel University, UK ON THE CALIBRATION OF THE GRAVITY MODEL

The Methods of Solution for Constrained Nonlinear Programming

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Deep Boosting. Abstract. 1. Introduction

arxiv: v1 [math.na] 10 Oct 2016

Randomized Recovery for Boolean Compressed Sensing

On Constant Power Water-filling

Distributed Subgradient Methods for Multi-agent Optimization

Block designs and statistics

3.3 Variational Characterization of Singular Values

Effective joint probabilistic data association using maximum a posteriori estimates of target states

1 Rademacher Complexity Bounds

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

A Theoretical Analysis of a Warm Start Technique

arxiv: v1 [cs.ds] 29 Jan 2012

Interactive Markov Models of Evolutionary Algorithms

Computational and Statistical Learning Theory

A Note on the Applied Use of MDL Approximations

Sequence Analysis, WS 14/15, D. Huson & R. Neher (this part by D. Huson) February 5,

1 Proof of learning bounds

Tracking using CONDENSATION: Conditional Density Propagation

Probability Distributions

Pseudo-marginal Metropolis-Hastings: a simple explanation and (partial) review of theory

Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations

Multiple Testing Issues & K-Means Clustering. Definitions related to the significance level (or type I error) of multiple tests

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Stochastic Subgradient Methods

arxiv: v1 [cs.lg] 8 Jan 2019

SimpleMKL. hal , version 1-26 Jan Abstract. Alain Rakotomamonjy LITIS EA 4108 Université de Rouen Saint Etienne du Rouvray, France

Support recovery in compressed sensing: An estimation theoretic approach

Hybrid System Identification: An SDP Approach

Lecture 9: Multi Kernel SVM

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Training an RBM: Contrastive Divergence. Sargur N. Srihari

DERIVING PROPER UNIFORM PRIORS FOR REGRESSION COEFFICIENTS

W-BASED VS LATENT VARIABLES SPATIAL AUTOREGRESSIVE MODELS: EVIDENCE FROM MONTE CARLO SIMULATIONS

Lower Bounds for Quantized Matrix Completion

Statistical clustering and Mineral Spectral Unmixing in Aviris Hyperspectral Image of Cuprite, NV

ESTIMATING AND FORMING CONFIDENCE INTERVALS FOR EXTREMA OF RANDOM POLYNOMIALS. A Thesis. Presented to. The Faculty of the Department of Mathematics

Mixed Robust/Average Submodular Partitioning

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

lecture 36: Linear Multistep Mehods: Zero Stability

Transcription:

Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the sign of F x i ) giving the prediction on exaple i. We assue that F is a su of a set of basis functions f j F, where F is the set of possible basis functions: T F x) = f j x i ) j= For now, we focus on weak learners of the following for, although siilar algoriths apply for other fors for the weak learners. { c x A f A,c x) = 0 x A where we have soe set A of possible partitions A and a score c R, so that F A R. An obvious objective to iniize is the average training error: J err = [y i signfx))] i This is very challenging optiization proble because it is not even continuous. Machine learning algoriths use various other loss functions that approxiate this objective but are easier to optiize. 2 Adaboost as greedy coordinate descent Adaboost attepts to iniize the exponential loss: J exp = T )) exp y i f j x i ) j=

which is an upper bound on the average training error. As discussed in [3] and [2], Adaboost iniizes this loss by doing greedy coordinate descent on the functions f j. That is, each f k is chosen such that: f k = argin f F k )) exp y i f j x i ) + f k x i ) where the previous f j for j =... k are fixed. Let F k x i ) = k j= f jx i ) be the aggregated contributions fro the existing ters for each data point i. Then, J exp = exp y i F k x i ) + f k x i ))) = w i exp y i f k x i )) j= where the weights w i un-noralized) are defined to be w i := exp y i F k x i )) Each iteration is then interpreted as training a new weak learner f k on the weighted training set with respect to the exponential loss. This coordinate-wise procedure can also be seen as a procedure for feature selection. We now derive the Adaboost updates for our choice of basis functions. At each iteration: J exp A, c) = w i exp y i f A,c x i )) [ = w i exp c) + w i expc) + ] w i i:x i A,y i =+ i:x i A,y i = i:x i A = [ exp c)w + A) + expc)w A) + W A) ] where W + A), W A), and W A) are the sus of weights for positive exaples with x i A, negative exaples with x i A, and exaples with x i A, respectively. We first find the optiu value of c for each choice of A. Differentiating with respect to c and setting to zero gives: c A) = ) W + 2 log A) W A) Plugging this expression into J exp A, c) gives the iniu acheivable loss using a basis function with partition set A: JexpA) = ) W W + + /2 ) A) W A) + W + /2 A) A) + W A)) W A) W A) = 2 ) W + A)W A) + W A) This expression is very convenient, because it is expressed in ters of sus of weights, which can be efficiently coputed for all A with sparse atrix ultiplication when A is large but discrete. 2

3 Greedy coordinate descent with the log-loss We can derive an algorith siilar to Adaboost that will perfor feature selection, but with the log-loss instead of the exponential loss. The log-loss does not increase exponentially as the argin gets ore and ore negative see Figure ), and is therefore ore robust to outliers and noisy data. The log-loss is: Figure : The log-loss as a function of the new c for a given set of weights w and a given A, with the scores chosen by the various algoriths above. J log = log [ + exp y i F k x i ) y i f k x i ))] Following what we did for Adaboost, we seek to greedily iniizing the following function at each step, with respect to the new function f k, keeping the previous functions fixed. f k = argin f F log [ + exp y i F k x i ) y i f k x i ))] Unfortunately, this cannot be expressed in ters of an optiization on a weighted training set because each log-loss ter does not factorize into the contribution of the previous functions F k and the contribution fro the new function f k, as was the case for the exponential loss. Algorith LogitBoost) For our choice of basis functions F, to solve this proble exactly we would have to perfor a nuerical optiization for each A to find the best c A), and then plug this c A) into the loss function to get Jlog A) and choose the A with the iniu value. The objective for the optiization of c, J log c; A), is convex in c, as we now show. Let Fk i = F k x i ) for ore copact notation. 3

d 2 dc 2 J logc; A) = = i:x i A d dc J logc; A) = i:x i A y i exp y i F i k + c)) + exp y i F i k + c)) [ y 2 i exp y i Fk i + c)) + exp y i Fk i + c)) + y i exp y i Fk i + c)) y i) ) exp y i Fk i + c)) ] + exp y i Fk i + c)))2 [ y 2 i exp y i Fk i + c)) + exp y i Fk i + c)) + y2 i exp 2y i Fk i + c)) ] + exp y i Fk i + c)))2 i:x i A Each single ter is positive if: exp y i Fk i + c)) + exp y i Fk i + c)) exp 2y i Fk i + c)) + exp y i Fk i + c)))2 + exp y i F i k + c))) exp y i F i k + c)) exp 2y i F i k + c)) exp y i F i k + c)) 0 Therefore the optiization over c is convex, and can be solved with Newton s ethod. The LogitBoost algorith of Friedan, Hastie and Tibshirani [2]) executes one Newton step, and this appears to be sufficient to get very close to the true iniu, as seen in in Figure 2. However, Friedan, Hastie and Tibshirani also suggest intentionally using a single Newton step instead of exact iniization for the Adaboost objective, which they call Gentle Adaboost. The optial c for the log-loss single Newton step algorith LogitBoost) is just the Newton step itself, since the starting point is always c = 0: c A) = = d J dc logc; A) c=0 d 2 J dc 2 log c; A) c=0 i:x i A y i exp y i Fk i ) i:x i A +exp y i Fk i ) [ y ] 2 i exp y i Fk i ) + y2 i exp 2y ifk i ) +exp y i Fk i ) +exp y i Fk i ))2 i:x i A,y i =+ w i i:x i A,y i = w i i:x i A w i w i ) = W + A) W A) W A) where w i := are interpreted as un-noralized) weights, and W + +expy i F k x i A) and )) W A), are sus of weights as before and W A) = i:x i A w i w i ). Plugging in this value of c for one Newton step gives an estiate for the best loss achievable for a choice of 4

A: J loga) i:x i A ) W + A) W A) log + exp y i F i y i ) W A) + i:x i A log + exp y i F i )) We refer to this process of choosing the c A) by one Newton step, and then finding the A with iniu resulting log-loss as LogitBoost. Unfortunately, this does not decopose into a direct function of sus over weights, and so we cannot use sparse atrix ultiplcation loss to choose the best A as we did with the exponential loss. Instead, for each A, we ust loop over the training set to copute the true log-loss. Algorith 2 QuadBoost) If we approxiate the iniu loss for each A at c A) by the iniu of the second-order expansion of the loss which is what the Newton step actually iniizes), then we get a quantity that is expressed in ters of sus. This is given by: d dc J logc; A) c=0 ) 2 J A) J prev 2 d 2 J dc 2 log c; A) c=0 = i:x i A y ) 2 iw i 2 i:x i A w i w i ) i:xi A,yi=+ w i 2 i:xi A,yi= i) w = J prev 2 i:x i A w i w i ) This expression only uses sus of weights, which we can copute efficiently for all A at once using atrix ultiplication. We refer to this process of choosing c A) by a single Newton step, and then using the second-order approxiation criterion above to find the best A as QuadBoost ). Note that for MEDUSA this update only requires one set of atrix ultiplications instead of two for the existing code. Algorith 3 GradientBoost) An different approxiation also allows the search over our set A to be done using sparse atrix ultiplication. Instead of using the A that gives the optial loss when J log c; A) is optiized, we use the A that gives the highest value of d J dc logc; A) c=0. The intuition is that this is the local direction of steepest descent in A space. Then, we perfor a line search for the optial c for our choice of A. For the log-loss, the criterion for choosing A is: d dc J logc; A) c=0 = i:x i A y i exp y i F k x i )) + exp y i F k x i )) = W + A) W A) 5

where the un-noralized weights w i are as in LogitBoost. The A A that axiizes this criterion can be found efficiently, since the W s are found for all A with sparse atrix ultiplication. Then, we can find the exact optial choice of c A) by a single Newton step. A siilar idea to this is presented in [?]. Algorith 4 CollinsBoost) Collins, Schapire, and Singer []) present a sequential update algorith for both the exponential and log-loss in Figure 2, using the language of Bregan divergences to ake the connection. When applied to the exponential loss, the algorith is basically identical to Adaboost. When applied to the log-loss, the algorith is distinct fro any of the above. We derive a version of the algorith here in a siple way without discussing Bregan distances. Consider the following upper bound on the log-loss for a given training exaple i, as a function of the choice of new basis function f k : J i logf k ) = log + exp y i F i k + f k x i )) ) = log + exp y i Fk i + f k x i ) ) log + exp y i Fk ) ) i + log + exp y i Fk ) ) i + exp yi Fk i = log + f ) kx i )) + exp y i Fk i ) + log + exp y i Fk ) ) i expyi Fk i = log ) + exp y ) if k x i )) + expy i Fk i x + log + exp y i Fk ) ) i i)) ) = log + expy i Fk i ) + + expy i Fk i ) expy if k x i )) + log + exp y i Fk ) ) i + expy i Fk i ) exp y if k x i )) ) + log + exp y i Fk ) ) i where the last line coes fro log+x) x. Note that this bound is exact when f k x i ) = 0. An upper bound on the total log-loss is then: J log f k ) + expy i Fk i ) exp y if k x i )) ) + log + exp y i Fk ) ) i = w i exp y i f k x i )) wi log + exp y i F i k )) ) where w i is again as in Algoriths and 2. The ters on the right are constant, so iniizing this upper bound on each iteration is done by finding: f k = argin f w i exp y i fx i )) Note that this is identical to the update rule for Adaboost, except for the redefinition of the weights w i. This was noted briefly in [3]. This eans that the update rules for finding the optial A and c are also identical, except for the choice of weighting function. 6

Note that each step is not training a log-loss weak learner on the weighted training set, but an exponential-loss weak learner. The fact that the upper bound is exact for c = 0 suggests that this algorith will be ost exact in the later stages of the training. This algorith appears to perfor significantly worse than Algorith. Algorith 5 MedusaBoost) The current logit-boost MEDUSA code iniizes the following function exactly) at each step with respect to f A,c : J ed A, c) = + expy i F i k ) log + exp y if A,c x i ))) = w i log + exp y i f A,c x i ))) The intuition is that we are training a log-loss weak learner on a re-weighted version of the training set on each iteration. This akes sense conceptually, but is not any of the algoriths in the literature discussed above. For our set of basis functions, this function is: J ed A, c) = w i log + exp c)) + w i log2) + i:x i A,y i =+ i:x i C,y i = w i log + expc)) + Finding the optial c for a given A by differentiating: c A) = log W + A) W A) i:x i A,y i =+ i:x i A,y i = w i log2) Note that this is twice the prediction of Algorith 3. Plugging this into the objective gives the expression for the optial loss for a given A: JedA) = W total log 2 + W + A) log 0.5 + 0.5 W ) A) + W A) log 0.5 + 0.5 W ) + A) W + A) W A) which is the criterion used to choose a rule A in the code. If interpreted as an algorith for approxiately greedily iniizing the total log loss J log, this algorith finds the optial c by using the following approxiation of the derivative of the log-loss: d dc log + exp y i Fk i + c) ) y i = + expy i Fk i + c)) y i + expy i Fk i ) + expy i c) This approxiation is best when the argin y i F i k is near zero. 7

3. Coparison of algoriths I evaluated how well each of the above algoriths greedily iniizes the log-loss at each iteration, as well as the train and test accuracy, for a particular data setting: I used a set of 3D chip-features exonarray_cons_vs_uw_fdr0.0_pks_proxk_v), with no paired regulator features, in addition to the NOT of all these chip features, and a bias feature for a total of 30 + 30 + base features). The training log-loss the quantity being iniized) is plotted in Figure 2 for each of the algoriths. I also evaluated the train and test accuracy of each of these algoriths, shown in Figure 3. Figure 2: The LogitBoost, MedusaBoost, and QuadBoost all perfor very siilarly, with QuadBoost becoing less optial in later iterations. LogitBoost is a uch ore coputationally heavy algorith, since it requires an additional loop of the training exaples that the other two do not. AdaBoost is shown even though it is not attepting to iniize this loss function. 8

Figure 3: All algoriths have test accuracy within or 2 % on this proble. CollinsBoost appears to perfor the worst of the log-loss algoriths. The training accuracy of LogitBoost, QuadBoost and MedusaBoost are basically identical. 3.2 Extension to real-valued features Although it was not presented this way, the Alternating Decision Tree algorith uses a predictor function of the following for: F x) = T t= f tj x) where the f tj x) are base features that are boolean conditions on x, and the product of these functions fors an AND. The loss function the exponential loss was used in the ADT algorith) is the iniized greedily with respect to adding a new ter at each iteration. The tree part of the algorith derived fro the restriction that new ters can only involve a cobination of base features that existed in a previous ter, plus one additional base feature. We can attept extend this odel to incorporate general real-valued base features f tj x). Let c be the score of the new feature, and let f k x) be the new feature. The exponential loss used by Adaboost and the original ADT is then: J exp c, f k ) = w i exp y i cf k x i )) Whereas previously the f k x i ) was either zero or one. In our new case, there is no longer a closed-for expression for the optial c given a choice of f k. We would have to resort to a nuerical optiization. The sae holds if we want to use the log-loss. 9 c t n t j=

Interestingly, if we use QuadBoost or GradientBoost procedure as above, we can still use atrix ultiplications to approxiate which f k x i ) should be chosen. The sae derivations as above apply to this case. For GradientBoost the criterion for choosing a feature f k becoes: dj log c; f k ) y i f k x i ) exp y i F k x i ))) dc = c=0 + exp y i F k x i ) = y i f k x i )w i = w i f k x i ) w i f k x i ) i:y i = i:y i =+ where the w i := as before. If the f +expy i F k x i )) kx i ) are expressed in a atrix or pair of atrices for decoposable features used in MEDUSA then these sus can be perfored with a atrix ultiplication. For LogitBoost, having chosen an f k using the above GradientBoost ethod, we would choose the optial c using a single Newton step, which is now: c i:y f k ) = i =+ w if k x i ) i:y i = w if k x i ) i f kx i ) 2 w i w i ) i = w iy i f k x i ) i f kx i ) 2 w i w i ) The QuadBoost criterion for choosing an f k is then: J f k ) = w iy i f k x i )) 2 2 f kx i ) 2 w i w i ) and these sus can be coputed by atrix ultiplication. For MEDUSA with decoposable features, the su over i becoes a su over g and e, and code does not have to change. 3.3 Binary response variable logistic regression) The loss as a function of the new feature choice k and new coefficient c is: Jk, c) = log + exp y i F i + ca ki ))) The derivative with respect to c is: dj dc = exp y i F i + ca ki )) y i A ki + exp y i F i + ca ki )) = y i A ki + expy i F i + ca ki )) Setting the derivative equal to zero does not given an analytic solution even if we restrict ourselves to binary features). Instead, we 0

3.4 Poisson response variable If the response variable is distributed according to a Poisson distribution where logλ) is a weighted su of features, then, the loss as a function of the new choice of feature k and the new score c derived fro the negative log-likelihood) is: Jk, c) = = c y i F i + ca ki ) + exp F i + ca ki ) y i A ki + The derivative with respect to c is dj dc = y i A ki + expca ki ) expf i ) y i F i A ki expca ki ) expf i ) Setting the derivative to zero does not given an analytic solution for the optial c in the general case, and we will resort to using one Newton step as before. However, in the case of boolean features A ki {0, }) we can find an analytic solution: expc) A ki expf i ) = y i A ki = c k = log A ) kiy i A ki expf i ) Substituting this expression into Jk, c) gives the optial loss achievable for each choice of feature k to add: ) Jk = J prev A ki expf i ) A ki y i log A ) ) kiy i A ki expf i ) The required sus are A kiy i and A ki expf i ), which can each be coputed for all k with atrix ultiplications. If we don t restrict ourselves to boolean features, then we resort to a Newton step. The first and second derivatives with respect to c, evaluated at c = 0 are: dj dc = y i A ki + A ki expf i ) c=0 The Newton step is then: d 2 J dc 2 = c=0 c k = A 2 ki expf i ) dj dc c=0 = d 2 J c=0 dc 2 A kiy i F i ) A2 ki expf i)

3.5 Gaussian response variable If the response variable is norally distributed, then the loss as a function of the new choice of feature k and the new score c is: Jk, c) = y i F i ca ki ) 2 Miniizing with respect to c we have the optial c if for each k: c k = A kiy i F i ) A2 ki Substituting this expression for c k into Jk, c) gives the optial loss achievable for each choice k of feature to add: Jk j= = y i F i A A ) 2 kjy j F j ) ki j= A2 kj = y i F i ) 2 j= 2 y i F i )A A kjy j F j ) ki + A 2 j= A ) 2 kjy j F j ) j= A2 ki kj j= A2 kj = J prev A kiy i F i )) 2 A2 ki The nuerator and denoinator can be calculated efficiently for all k at once using atrix ultiplication. If the data consists of two diensions g and e, and the features are the product of two base features and r of the for A r ge = BgeC e, r then the optial loss for a choice of and r is: J r = J prev G E g= e= B gecey r ge F ge ) E ) e= B 2 ge C r e ) 2 G g= Now, the nuerator involves a specialized atrix ultiplication By F )C of the sae coplexity as a noral atrix ultiplication). References [] Michael Collins, Robert E. Schapire, and Yora Singer. Logistic regression, adaboost and bregan distances. In MACHINE LEARNING, pages 58 69, 2000. [2] Jeroe Friedan, Trevor Hastie, and Robert Tibshirani. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28:2000, 998. [3] Robert E. Schapire. The boosting approach to achine learning: An overview, 2002. ) 2 2