Transform Regression and the Kolmogorov Superposition Theorem

Size: px

Start display at page:

Download "Transform Regression and the Kolmogorov Superposition Theorem"

Erin Flowers
6 years ago
Views:

1 Transform Regression an the Kolmogorov Superposition Theorem Ewin Penault IBM T. J. Watson Research Center Kitchawan Roa, P.O. Box 2 Yorktown Heights, NY 59 USA penault@us.ibm.com Abstract This paper presents a new preictive moeling algorithm that raws inspiration from the Kolmogorov superposition theorem. An initial version of the algorithm is presente that combines graient boosting, generalize aitive moels, an ecision-tree methos to construct moels that have the same overall mathematical structure as Kolmogorov s superposition equation. Improvements to the algorithm are then presente that significantly increase its rate of convergence. The resulting algorithm, ubbe transform regression, generates surprisingly goo moels compare to those prouce by the unerlying ecision-tree metho when the latter is applie irectly. Transform regression is highly scalable an a parallelize atabase-embee version of the algorithm has been implemente as part of IBM DB2 Intelligent Miner Moeling. Keywors: Graient boosting, Generalize aitive moeling, Decision trees.. Introuction In many respects, ecision trees an neural networks represent iametrically oppose classes of learning techniques. A strength of one is often a weakness of the other. Decision-tree methos approximate response surfaces by segmenting the input space into regions an using simple moels within each region for local surface approximation. The strengths of ecision-tree methos are that they are nonparametric, fully automate, an computationally efficient. Their weakness is that statistical estimation errors increase with the epth of trees, which ultimately limits the granularity of the surface approximation that can be achieve for fixe size ata. In contrast, neural network methos fit highly flexible families of nonlinear parametric functions to entire surfaces to construct global approximations. The strength of this approach is that it avois the increase in estimation error that accompanies segmentation an local moel fitting. The weakness is that fitting nonlinear parametric functions to ata is computationally emaning, an these emans are exacerbate by the fact that several network architectures often nee to be traine an evaluate in orer to maximize preictive accuracy. This paper presents a new moeling approach that attempts to combine the strengths of the methos escribe above specifically, the global fitting aspect of neural networks with the automate, computationally efficient, an nonparametric aspects of ecision trees. To achieve this union, this new moeling metho raws inspiration from the Kolmogorov superposition theorem: Theorem (Kolmogorov, 957). For every integer imension 2, there exist continuous real functions h i (x) efine on the unit interval U = [,], such that for every continuous real function f(x,,x ) efine on the -imensional unit hypercube U, there exist real continuous functions g i (x) such that 2 + f ( x,..., x ) = gi hi ( x ). i= = Stronger versions of this theorem have also been reporte (Lorentz, 92; Sprecher, 95). The theorem is interesting because it states that even the most complex multivariate functions can be ecompose into combinations of univariate functions, thereby enabling cross-prouct interactions to be moele without introucing cross-prouct terms in the moel. Hecht-Nielson (97) has note that the superposition equation can be interprete as a three-layer neural network an has suggeste using the theorem as basis for unerstaning multilayer neural networks. Girosi an Poggio (99), on the other han, have criticize this suggestion for several reasons, one being that applying Kolmogorov s theorem woul require the inuctive learning of nonparametric activation functions. Neural network methos, by contrast, usually assume that the activation functions are given an the problem is to learn values for the weights that appear in the networks. Although the usual paraigm for training weights can be extene to incorporate the learning of smooth parametric activation functions (i.e., by incluing their parameters in the partial erivatives that are calculate uring training), the incorporation of nonparametric learning methos into the training paraigm was seen as problematic. 35

2 Nonparametric learning, on the other han, is a key strength of ecision-tree methos. The learning of nonparametric activation functions thus provies a starting point for combining the global-fitting aspects neural network methos with nonparametric learning aspects of ecision-tree methos. In the sections that follow, an initial algorithm is presente an then subsequently refine that uses ecision-tree methos to inuctively learn instantiations of the g i an h i functions that appear in Kolmogorov s superposition equation so as to make the equation a goo preictor of unerlying response surfaces. In this respect, the initial algorithm an its refinement are inspire by, but are not mathematically base upon, Kolmogorov s theorem. The g i an h i functions that are create by the algorithms presente here are quite ifferent from those that are constructe in the various proofs of Kolmogorov s theorem an its variants. The latter are highly nonsmooth fractal functions that in some respects are comparable to hashing functions (Girosi an Poggio, 99). Moreover, Kolmogorov s theorem requires that the h i functions be universal for a given imension ; that is, the h i functions are fixe for each imension an only the g i functions epen on the specific function f. The initial algorithm presente below, on the other han, heuristically constructs both g i an h i functions that epen on training ata. It is important to note that neither universality nor the specific functional forms of the g i an h i functions that appear in the various proofs of Kolmogorov s theorem are necessary in orer to satisfy the superposition equation. For example, consier the function f(x,y) = xy. This function can be rewritten in superpositional form as f(x,y) =.25(x + y) 2.25(x y) 2. In this case, h (x) = h 2 (x) = x, h 2 (y) = y, h 22 (y) = y, h i (z) = for i, > 2, g (z) =.25z 2, g 2 (z) =.25z 2, an g i (z) = for i > 2. Although these particular g i an h i functions satisfy the superposition equation, they o not satisfy the preamble of the theorem because the above h i functions are not universal for all continuous functions f(x,y). Nor o the above g i an h i functions at all resemble the g i an h i functions that are constructe in the proofs of Kolmogorov s theorem an its variants. In general, for any given function f(x,,x ), there can exist a wie range of g i an h i functions that satisfy the superposition equation without satisfying the preamble of the theorem. Taking the above observations to heart, the initial algorithm presente below likewise ignores the preamble of Kolmogorov s theorem an instea focuses on the mathematical structure of the superposition equation itself. Decision-tree methos an greey search heuristics are use to construct g i an h i functions base on training ata in an attempt to make the superposition equation a goo preictor. The approach contrasts with previous work on irect application of the superposition theorem (Nerua, Štěrý, & Drkošová, 2; Sprecher 99, 997, 22). One ifficulty with irect application is that the g i an h i functions that nee to be constructe are extremely complex an entail very large computational overheas to implement, even when the target function is known (Nerua, Štěrý, & Drkošová, 2). Another problem is that noisy ata is highly problematic. The approach presente here avois both these issues, but it is heuristic in nature. Although there are no mathematical guarantees of obtaining goo preictive moels using this approach, the initial algorithm an its refinements nevertheless prouce very goo results in practice. Thus, one of the contributions of this paper is to emonstrate that the mathematical form of the superposition theorem is interesting in an of itself, an can be heuristically exploite to obtain goo preictive moels. The initial algorithm is base on heuristically interpreting Kolmogorov s superposition equation as a graient boosting moel (Frieman, 2, 22) in which the base learner constructs generalize aitive moels (Hastie & Tibshirani, 99) whose outputs are then nonlinearly transforme to remove systematic errors in their resiuals. To provie the necessary backgroun to motivate this interpretation, graient boosting an generalize aitive moeling are first briefly overviewe in Sections 2 an 3. The initial algorithm is then presente in Section. The initial algorithm, however, has very poor convergence properties. Sections 5 an therefore present improvements to the initial algorithm to obtain much faster rates of convergence, yieling a new algorithm calle transform regression. Faster convergence is achieve by moifying Frieman s graient boosting framework so as to introuce a nonlinear form of Gram-Schmit orthogonalization. The moification requires generalizing the mathematical forms of the moels to allow constraine multivariate g i an h i functions to appear in the resulting moels in orer to implement the orthogonalization metho. The resulting moels thus epart from the pure univariate form of Kolmogorov s superposition equation, but the benefit is significantly improve convergence properties. The nonlinear Gram-Schmit orthogonalization technique is another contribution of this paper, since it can be combine with other graient boosting algorithms to obtain similar benefits, such as Frieman s (2, 22) graient tree boosting algorithm. Section 7 presents evaluation results that compare the performance of the transform regression algorithm to the unerlying ecision-tree metho that is employe. In the evaluation stuy, transform regression often prouce better preictive moels than the unerlying ecisiontree metho when the latter was applie irectly. This result is interesting because transform regression uses ecision trees in a highly constraine manner. Section iscusses some of the etails of a parallelize atabase-embee implementation of transform 3

3 regression that was evelope for IBM DB2 Intelligent Miner Moeling. Section 9 presents conclusions an iscusses possible irections for future work. 2. Graient boosting Graient boosting (Frieman, 2, 22) is a metho for making iterative improvements to regression-base preictive moels. The metho is similar to graient escent except that, instea of calculating graient irections in parameter space, a learning algorithm (calle the base learner) is use to estimate graient irections in function space. Whereas with conventional graient escent each iteration contributes an aitive upate to the current vector of parameter values, with graient boosting each iteration contributes an aitive upate to the current regression moel. When the learning obective is to minimize total square error, graient boosting is equivalent to Jiang s LSBoost.Reg algorithm (Jiang, 2, 22) an to Mallat an Zhang s matching pursuit algorithm (Mallat an Zhang, 993). In the special case of square-error loss, the graient irections in function space that are estimate by the base learner are moels that preict the resiuals of the current overall moel. Moel upating is then accomplishe by summing the output of the current moel with the output of the moel for preicting the resiuals, which has the effect of aing correction terms to the current moel to improve its accuracy. The resulting graient boosting algorithm is summarize in Table. If the base learner is able to perform a true least-squares fit on the resiuals at each iteration, then the multiplying scalar α will always be equal to one. In the general case of an arbitrary loss function, the resiual error (a.k.a., pseuo-resiuals) at each iteration is equal to the negative partial erivative of the loss function with respect to the moel output for each ata recor. In this more general setting, line search is usually neee to optimize the value of α. Table. The graient boosting metho for minimizing square error. Let the current moel M be zero everywhere; Repeat until the current moel M oes not appreciably Use the base learner to construct a moel R that preicts the resiual error of M, ensuring that R oes not overfit the ata; Fin a value for scalar α that minimizes the loss (i.e., total square error) for the moel M + αr; Upate M M + αr; When applying graient boosting, overfitting can be an issue an some means for preventing overfitting must be employe (Frieman, 2, 22; Jiang, 2, 22). For the algorithms presente in this paper, a portion of the training ata is reserve as a holout set which the base learner employs at each iteration to perform moel selection for the resiual moels. Because it is also possible to overfit the ata by aing too many boosting stages (Jiang, 2, 22), this same holout set is also use to prune the number of graient boosting stages. The algorithms continue to a graient boosting stages until a point is reache at which either the net improvement obtaine on the training ata as a result of an iteration falls below a preset threshol, or the preiction error on the holout set has increase steaily for a preset number of iterations. The current moel is then prune back to the boosting iteration that maximizes preictive accuracy on the holout set. 3. Generalize aitive moels Generalize aitive moels (Hastie & Tibshirani, 99), is a metho for constructing regression moels of the form: ~ y = y + h ( x ) = Hastie an Tibshirani s backfitting algorithm is typically use to perform this moeling task. Backfitting assumes the availability of a learning algorithm calle a smoother for estimating univariate functions. Traitional examples of smoothers inclue classical smoothing algorithms that use kernel functions to calculate weighte averages of training ata, where the center of the kernel is the point at which the univariate function is to be estimate an the weights are given by the shapes of the kernel functions. However, in general, any learning algorithm can be use as a smoother, incluing ecision tree methos. An attractive aspect of ecision tree methos is that they can explicitly hanle both numerical an categorical input features, whereas classical kernel smoothers require preprocessing to construct numerical encoings of categorical features. With the backfitting algorithm, the value of y woul first be set to the mean of the target variable an a smoother woul then be applie to successive input variables x in roun-robin fashion to iteratively (re)estimate the h functions until convergence is achieve. The resulting algorithm is summarize in Table 2. Overfitting an loop termination can be hanle in a similar fashion as for graient boosting. The initialization of the h functions can be accomplishe by setting them to be zero everywhere. Alternatively, one can obtain very goo initial estimates by applying a. 37

4 Table 2. The backfitting algorithm. Set y equal to the mean of the target variable y; For =,, initialize the function h (x ); Repeat until the functions h (x ) o not appreciably For =,, o the following: Use the smoother to construct a moel H (x ) that preicts the following target value using only input feature x : new target = y y h k ( x k ) Upate h (x ) H (x ); k Table 3. A greey one-pass aitive moeling algorithm. Set y equal to the mean of the target variable y; For =,, use the smoother to construct a moel H (x ) that preicts the target value ( y y ) using only input feature x : Calculate linear regression coefficients λ such that λ H ( x ) is a best preictor of y y ) ; For =,, set h x ) = λ H ( x ) ; ( ( greey one-pass approximation to backfitting that inepenently applies the smoother to each input x an then combines the resulting moels using linear regression. This one-pass aitive moeling algorithm is summarize in Table 3. Remarkably, the greey one-pass algorithm shown in Table 3 can often prouce surprisingly goo moels in practice without aitional iterative backfitting. The onepass algorithm also has the avantage that overfitting can be controlle in the linear regression calculation, either via feature selection or by applying a regularization metho. This overfitting control is available in aition to overfitting controls that may be provie by the smoother. Examples of the latter inclue kernel-with parameters in the case of classical kernel smoothers an tree pruning in the case of ecision-tree methos.. An initial algorithm As mentione earlier, inspiration for the initial algorithm presente below is base on interpreting Kolmogorov s superposition equation as a graient boosting moel in which the base learner constructs generalize aitive moels whose outputs are then nonlinearly transforme to remove systematic errors in their resiuals. To motivate this interpretation, suppose that we are trying to infer a preictive moel for y as function of inputs x,,x given a set of noisy training ata { x,,x, y }. As a first attempt, we might try constructing a generalize aitive moel of the form where = h ( x ) = y + h ˆ ( x ) = = ~ y x ) h ( x ) + h, () ( = ˆ y. (2) This moeling task can be performe by first applying either the backfitting algorithm shown in Table 2 or the greey one-pass algorithm shown in Table 3, an then istributing the aitive constant y that is obtaine equally among the transforme inputs as per Equation 2. Inepenent of whether backfitting or the greey onepass algorithm is applie, resiual nonlinearities can still exist in the relationship between the aitive moel output ỹ an the target value y. To remove such nonlinearities, the same smoother use for aitive moeling can again be applie, this time to linearize ỹ with respect to y. The resulting combine moel woul then have the form = g ( ~ ) = y g h ( x ). (3) = To further improve the moel, graient boosting can be applie by using the above two-stage moeling technique as the base learner. The resulting graient boosting moel woul then have the form ~ y i = = h ( x ) = i ( hˆ i ( x ) + yi ) = (a) ( ) i = g ~ i yi = gi hi ( x ) (b) = = = i gi h i ( x ). (c) i i = Equations a an b efine the stages in the resulting graient boosting moel. Equation a efines the generalize aitive moels ỹ i that are constructe in each boosting stage, while Equation b efines the boosting stage outputs ŷ i. Equation c efines the output 3

5 (a) (b) (c) () Figure. An example of the moeling behavior of the initial algorithm. (a) The target function. (b) The moel output after one graient boosting stage. (c) After two boosting stages. () After ten boosting stages. ŷ of the overall moel, which is the sum of the boosting stage outputs ŷ i. The resulting algorithm will thus generate preictive moels that have the same mathematical form as Kolmogorov s superposition equation. The algorithm itself is summarize in Table. Table. The initial algorithm. Let the current moel M be zero everywhere; Repeat until the current moel M oes not appreciably At the i th iteration, construct an aitive moel ỹ i that that preicts the resiual error of M using the raw features x,,x as input, ensuring that ỹ i oes not overfit the ata; Construct an aitive moel ŷ i that preicts the resiual error of M using the output of moel ỹ i as the only input feature, ensuring that ŷ i oes not overfit the ata; Upate M M + ŷ i ; To emonstrate the behavior of the initial algorithm, the ProbE linear regression tree (LRT) algorithm (Nataraan & Penault, 22) was use as the smoother in combination with the one-pass greey aitive moeling algorithm shown in Table 3. The LRT algorithm constructs ecision trees with multivariate linear regression moels in the leaves. To prevent overfitting, LRT employs a combination of tree pruning an stepwise linear regression techniques to select both the size of the resulting tree an the variables that appear in the leaf moels. Preictive accuracy on a holout ata set is use as the basis for making this selection. In its use as a smoother, the LRT algorithm was configure to construct splits only on the feature being transforme. In aition, in the case of numerical features, the feature being smoothe was also allowe to appear as a regression variable in the leaf moels. The resulting h i functions are thus piecewise linear functions for numerical features x, an piecewise constant functions for categorical features x. For the linear regression operation in Table 3, forwar stepwise regression was use for feature selection an the same holout set use by LRT for tree pruning was use for feature pruning to prevent overfitting. 39

6 For illustration purposes, synthetic training ata was generate using the following target function: π x z = f ( x, y) = x + y + sin 2 π y sin 2 Data was generate by sampling the above function in 2 the region x, y [,] at gri increments of.. This ata was then subsample at increments of. to create a test set, with the remaining ata ranomly ivie into a training set an a holout set. Figure illustrates the above target function an the preictions on the test set after one, two, an ten boosting stages. As can be seen in Figure, the algorithm is able to moel the cross-prouct interaction expresse in Equation 5, but the convergence of the algorithm is very slow. Even after ten boosting stages, the root mean square error is.239, which is quite large given that no noise was ae to the training ata. Nevertheless, Figure oes illustrate the appeal of the Kolmogorov superposition theorem, which is its implication that cross-prouct interactions can be moele without explicitly introucing cross-prouct terms into the moel. Figure 2 illustrates how the initial algorithm is able to accomplish the same solely by making use of the mathematical structure of the superposition equation. Figures 2a an 2b show scatter plots of the test ata as viewe along the x an y input features, respectively. Also plotte in Figures 2a an 2b as soli curves are the feature transformations ĥ x (x) an ĥ y (y), respectively, that were constructe from the x an y inputs. Figure 2c shows a scatter plot of the test ata as viewe along the aitive moel output ỹ, together with the output of the g (ỹ ) function plotte as a soli curve. (5) As can be seen in Figures 2a an 2b, the first stage feature transformations ĥ x(x) an ĥ y(y) extract only the linear terms in the target function. From the point of view of these transformations, the cross-prouct relationship appears as heteroskeastic noise. However, as shown in Figure 2c, from the point of view of the aitive moel output ỹ, the cross-prouct relationship appears as resiual systematic error together with lower heteroskeastic noise. This resiual systematic error is moele by the g (ỹ ) transformation of the first boosting stage. The resulting output prouces the first approximation of the cross-prouct interaction shown in Figure b. As this example illustrates, the nonlinear transformations g i in Equation b (an in Kolmogorov s theorem) have the effect of moeling cross-prouct interactions without requiring that explicit cross-prouct terms be introuce into the moels. Target Value Z Target Value Z Target Value Z Input Feature X 2-2 (a) Input Feature Y 2-2 (b) Aitive Moel Output ỹ (c) Figure 2. The test ata as seen from various points in the first boosting stage. (a) Test ata an h x (x) plotte against the x axis. (b) Test ata an h y (y) plotte against the y axis. (c) Test ata an g (ỹ ) plotte against the erive ỹ axis.

7 5. Orthogonalize, fee-forwar graient boosting A necessary conition that must be satisfie to maximize the rate of convergence of graient boosting for squareerror loss is that the boosting stage outputs must be mutually orthogonal. To see why, let R i be the output of the i th boosting stage. Then the current moel M i obtaine after i boosting iterations is given by i M i = α R. = If some of the boosting stage outputs are not mutually orthogonal, then uner normal circumstances there will exist coefficients λ i such that the moel M given by i M i = λ α R = will have a strictly smaller square error on the training ata than moel M i. These coefficients can be estimate by performing a linear regression of the boosting stage outputs. Aitional boosting iterations woul therefore nee to be performe on moel M i ust to match the square error of M i. Hence, the rate of convergence will be suboptimal in this case. To illustrate, suppose the base learner performs a univariate linear regression using one an only one input, always picking the input that yiels the smallest square error. Graient boosting applie to this base learner prouces an iterative algorithm for performing multivariate linear regression that is essentially equivalent to coorinate escent. Suppose further that we have a training set comprising two inputs, x an y, an a target value f(x,y) = x+y with no noise ae. If the inputs are orthogonal (i.e., if the ot prouct between x an y is zero), then only two boosting iterations will be neee to fit the target function to within roun-off error. However, if the inputs are not orthogonal, more than two iterations will be neee, an the rate of convergence will ecrease as the egree to which they are not orthogonal increases (i.e., as the proection of one input ata vector onto the other increases). In orer to improve the rate of convergence, the graient boosting metho nees to be moifie so as to increase the egree of orthogonality between boosting stages. One obvious approach woul be to apply Gram-Schmit orthogonalization to the boosting stage outputs. Gram- Schmit orthogonalization is a technique use in QR ecomposition an relate linear regression algorithms. Its purpose is to convert simple coorinate escent into an optimal search metho for linear least-squares optimization by moifying the coorinate irections of successive regressors. In the case of graient boosting, the coorinate irections are the efine by the boosting i stage outputs. However, the usual Gram-Schmit orthogonalization proceure assumes that all coorinates are specifie up front an at the start of the proceure. Graient boosting, on the other han, constructs coorinates ynamically in a stagewise fashion, so an incremental orthogonalization proceure is neee. Although not as computationally efficient as the traitional Gram-Schmit proceure, incremental orthogonalization can be reaily accomplishe simply by replacing the αr term in the graient boosting algorithm shown in Table with a linear regression of the current boosting stage output R an all previous boosting stage outputs. Aing this step to the proceure prouces the orthogonalize graient boosting metho shown in Table 5. In the case of square-error loss, the line search step in Table 5 to fin an optimal scalar α is actually not neee because this scalar will equal one by construction. However, the line search is inclue in Table 5 to inicate how orthogonalize graient boosting generalizes to arbitrary loss functions. When orthogonalize graient boosting is applie to the univariate linear regression base learner escribe above, an optimum rate of convergence is achieve an the resulting overall algorithm is closely relate to the QR ecomposition algorithm for linear regression. Table 5. Orthogonalize graient boosting for square-error loss. Let the current moel M be zero everywhere; Repeat until the current moel M oes not appreciably At the i th iteration, use the base learner to construct a moel R i that preicts the resiual error of M, ensuring that R i oes not overfit the ata; Use least-squares linear regression to fin coefficients λ k, k i, such that Σλ k R k best fits the resiual error of M. Fin a value for scalar α that minimizes the loss function (i.e., total square error) for the moel M + α Σλ k R k ; Upate M M + α Σλ k R k ; Although the orthogonalize graient boosting metho in Table 5 is expeient, it oes not aress the unerlying source of non-orthogonality between boosting stage outputs, which is the base learner itself. A more refine solution woul be to strengthen the base learner so that it prouces moels whose outputs are alreay orthogonal with respect to previous boosting stages. To accomplish that, the approach presente here involves first moifying the graient boosting metho

8 Table. Fee-forwar graient boosting for square-error loss. Let the current moel M be zero everywhere; Repeat until the current moel M oes not appreciably At the i th iteration, use the base learner to construct a moel R i that preicts the resiual error of M using all previous boosting stage outputs R,, R i- as aitional inputs, while ensuring that R i oes not overfit the ata; Fin a value for scalar α that minimizes the loss (i.e., total square error) for the moel M + αr i ; Upate M M + αr i ; shown in Table so that, at each iteration, all previous boosting stage outputs are mae available to the base learner as aitional inputs. This moification yiels the fee-forwar graient boosting metho shown in Table. The primary motivation for fee-forwar graient boosting is to enable the base learner to perform an implicit Gram-Schmit orthogonalization instea of the explicit orthogonalization performe in Table 5. In particular, by making previous boosting stage outputs available to the base learner, it might be possible to moify the base learner so that it achieves a nonlinear form of Gram-Schmit orthogonalization, in contrast to the linear orthogonalization step in Table 5. In the next section, it will be shown how such a moification can in fact be mae to the aitive-moeling base learner use in the initial algorithm. An aitional but no less important effect of feeforwar graient boosting is that it further strengthens weak base learners by expaning their hypothesis spaces through function composition. If in the first iteration the base learner consiers moels of the form M (x,,x ), then in the secon iteration it will consier moels of the form M 2 (x,,x,m (x,,x )). Unless the base learner s hypothesis space is close uner this type of function composition, fee-forwar graient boosting will have the sie effect of expaning the base learner s hypothesis space without moifying its moe of operation. The strengthening of the base learner prouce by this expansion effect can potentially improve the accuracy of the moels that are constructe. Although fee-forwar graient boosting is motivate by orthogonality consierations, the metho steps efine in Table are not sufficient to guarantee orthogonality for arbitrary base learners, since some base learners might not be able to make full use of the outputs of previous boosting stages in orer to achieve orthogonality. In such cases, the orthogonalize fee-forwar graient boosting metho shown in Table 7 can be employe. Table 7. Orthogonalize fee-forwar graient boosting for square-error loss. Let the current moel M be zero everywhere; Repeat until the current moel M oes not appreciably At the i th iteration, use the base learner to construct a moel R i that preicts the resiual error of M using the previous boosting stage outputs R,, R i- as aitional inputs, while ensuring that R i oes not overfit the ata; Use least-squares linear regression to fin coefficients λ k, k i, such that Σλ k R k best fits the resiual error of M. Fin a value for scalar α that minimizes the loss function (i.e., total square error) for the moel M + α Σλ k R k ; Upate M M + α Σλ k R k ;. The transform regression algorithm In the case of the initial algorithm presente in Section, the base learner itself can be moifie to take full avantage of the outputs of previous boosting stages. In particular, two moifications are mae to the initial algorithm in orer to arrive at the transform regression algorithm. Because fee-forwar graient boosting permits boosting stage outputs to be use as input features to subsequent stages, the first moification is to convert the g i functions in Equation into h i functions by eliminating Equation b an by using the outputs of the aitive moels in Equations a as input features to all subsequent graient boosting stages. This moification is intene mainly to simplify the mathematical form of the resulting moels by exploiting the fact that feeforwar graient boosting explicitly makes boosting stage outputs available to subsequent stages, which enables the h i functions to perform ual roles: transform the input features an transform the aitive moel outputs. The secon moification is to introuce multivariate h i functions by further allowing the outputs of the aitive moels in Equations a to appear as aitional inputs to the h i functions in all subsequent stages. This moification is intene to push the orthogonalization of moel outputs all the way own to the construction of the h i feature transformation functions. As iscusse in Section, the ProbE linear regression tree (LRT) algorithm (Nataraan & Penault, 22) was use in the initial algorithm to construct the univariate g i an h i functions that appear in Equation. The LRT algorithm, however, is capable of constructing multivariate linear 2

9 regression moels in the leaves of trees, an not ust univariate linear regression moels as was neee for the initial algorithm. By allowing previous boosting stage outputs to appear as regressors in the leaves of these trees, the output of each leaf moel will then be orthogonal to those previous boosting stage outputs when the leaf conitions are satisfie. Hence, the overall nonlinear transformations efine by the trees will be orthogonal to the previous boosting stage outputs an any linear combination of the trees will likewise be orthogonal. With the above changes, the mathematical form of the resulting transform regression moels is given by the following system of equations: i = i = + i + h ik k = + = h ( x ) = (a) ( x,..., ) h i (,..., ), i > k where the notation ( x,..., y ) i (b) = ˆ, (c) i h i is use to inicate that function h i is meant to be a nonlinear transformation of x an that this transformation is allowe to vary as a function of y ˆ,..., i. Likewise for the h ik ( k,..., i ) functions that appear in Equation b. Note that the latter are the counterparts to the g i functions in Equation b. In concrete terms, when applying the ProbE LRT algorithm to construct y i ˆi h ( x,..., y ) an h (,..., y ) i ik k, the LRT algorithm is constraine to split only on the features being transforme (i.e., x i an ŷ k-, respectively) with all other inputs (i.e., y ˆ,..., i ) allowe to appear as regressors in the linear regression moels in the leaves of the resulting trees. Of course, the features being transforme can likewise appear as regressors in the leaf moels if they are numeric. Equation a correspons to the first boosting stage while Equation b correspons to all subsequent stages. Equation c efines the output of the overall moel. The resulting algorithm is summarize in Table. ˆi ˆi Although Equation eparts from the mathematical form of Kolmogorov s superposition equation, the above moifications ramatically improve the rate of convergence of the resulting algorithm. Figure 3 illustrates the increase rate of convergence of the Table. The transform regression algorithm. Let the current moel M be zero everywhere; Repeat until the current moel M oes not appreciably At the i th iteration, construct an aitive moel ŷ i that that preicts the resiual error of M using both the raw features x,,x an the outputs ŷ,, ŷ i- of all previous boosting stages as potential input features to ŷ i, allowing the feature transformation to vary as a function of the previous boosting stage outputs ŷ,, ŷ i-, an ensuring that ŷ i oes not overfit the ata; Upate M M + ŷ i ; transform regression algorithm compare to the initial algorithm when transform regression is applie to the same ata as in Figures an 2. In this experiment, the ProbE linear regression tree (LRT) algorithm was again use, this time exploiting its ability to construct multivariate regression moels in the leaves of trees. As with the initial algorithm, one-pass greey aitive moeling was use with stepwise linear regression, with the holout set use to estimate generalization error in orer to avoi overfitting. As shown in Figure 3a, because the g i functions have been remove, the first stage of transform regression extracts the two linear terms of the target function, but not the cross-prouct term. As can be seen in Figure 3, the first boosting stage therefore has a higher approximation error than the first boosting stage of the initial algorithm. However, for all subsequent boosting stages, transform regression outperforms the initial algorithm, as can be seen in Figures 3b-. As this example emonstrates, by moifying the graient boosting algorithm an the base learner to achieve orthogonality of graient boosting stage outputs, a ramatic increase the rate of convergence can be obtaine. 7. Experimental evaluation Table 9 shows evaluation results that were obtaine on eight ata sets that were use to compare the performance of the transform regression algorithm to the unerlying LRT algorithm. Also shown are results for the first graient boosting stage of transform regression, an for the stepwise linear regression algorithm that is use both in the leaves of linear regression trees an in the greey one-pass aitive moeling metho. The first four ata sets are available from the UCI Machine Learning Repository an the UCI KDD Archive. The last four are internal IBM ata sets. 3

10 (a) (b) RMS Error Transform Regression Initial Algorithm Boosting Stage (c) () Figure 3. An example of the moeling behavior of the transform regression algorithm. (a) Moel output after one graient boosting stage. (b) After two stages. (c) After three stages. () RMS errors of successive graient boosting stages. Because all ata sets have nonnegative target values, an because all but one (i.e., KDDCup9 TargetD) have / target values, comparisons were mae base on Gini coefficients of cumulative gains charts (Han, 997) estimate on holout test sets. Cumulative gains charts (a.k.a., lift curves) are closely relate to ROC curves, except that gains charts have the benefit of being applicable to continuous nonnegative numeric target values in aition to / categorical values. Gini coefficients are normalize areas uner cumulative gains charts, where the normalization prouces a value of zero for moels that are no better than ranom guessing, an a value of one for perfect preictors. Gini coefficients are thus closely relate to AUC measurements (i.e., the areas uner ROC curves). Note, however, that moels that are no better than ranom guessing have an AUC of.5 but a Gini coefficient of zero. As can be seen in Table 9, transform regression prouces better moels than the unerlying LRT algorithm on all ata sets but one, an for the one exception the LRT moel is only slightly better. Remarkably, the first graient boosting stage also prouces better moels than Table 9. Gini coefficients for ifferent ata sets an algorithms. For each ata set, the best coefficient is highlighte in bol, the secon best in italics. DATA SET TRANS REG FIRST BOOST STAGE LIN REG TREES STEP LIN REG ADULT COIL KDD-9 B KDD-9 D A D M R the LRT algorithm on a maority of the ata sets. In one case, the first stage moel is also better than the overall transform regression moel, which inicates an overfitting problem with the prototype implementation use for these experiments.

11 . Computational consierations In aition to its other properties, transform regression can also be implemente as a computationally efficient parallelizable algorithm Such an implementation is achieve by combining the greey one-pass aitive moeling algorithm shown in Table 3 with the ProbE linear regression tree (LRT) algorithm (Nataraan & Penault, 22). As per Table 3, each boosting stage is calculate in two phases. The first phase constructs initial estimates of the feature transformation functions h i an h ik that appear in Equations a an b. The secon phase performs a stepwise linear regression on these initial feature transformations in orer to select the most preictive transformations an to estimate their scaling coefficients, as per Table 3. The ProbE LRT technology enables the computations for the first phase to be performe using only a single pass over the ata. It also enables the computations to be ata-partition parallelize for scalability. The LRT algorithm incorporates a generalize version of the bottom-up merging technique use in the CHAID algorithm (Biggs, eville, an Suen 99; Kass 9). Accoringly, multiway splits are first constructe for each input feature. Next, ata is scanne to estimate sufficient statistics for the leaf moels in each multiway split. Finally, leaf noes an their sufficient statistics are merge in a bottom-up pairwise fashion to prouce trees for each feature without further accessing the ata. For categorical features, the category values efine the multiway splits. For numerical features, the feature values are iscretize into intervals an these intervals efine the multiway splits. Although the CHAID metho consiers only constant leaf moels, the approach can be generalize to inclue stepwise linear regression moels in the leaves (Nataraan & Penault, 22). In the case of linear regression, the sufficient statistics are mean vectors an covariance matrices. By calculating sufficient statistics simultaneously for both training ata an holout ata, the tree builing an tree pruning steps can be performe using only these sufficient statistics without any further ata access. Linear regression tree estimates of the h i an h ik feature transformation functions can therefore be calculate using only a single pass over both the training an holout ata at each iteration. In aition, because the technique of merging sufficient statistics can be applie to any isoint ata partitions, the same merging metho use uring tree builing can be use to merge sufficient statistics that are calculate in parallel on isoint ata partitions (Dorneich et al., 2). This merging capability enables ata scans to be reaily parallelize. In the first phase, the stepwise linear regression moels that appear in the leaves of the feature transformation trees are relatively small. At each iteration, the maximum number of regressors is equal to the iteration number. In the secon phase, no trees are constructe but a large stepwise linear regression is performe instea. In this case, the number of regressors is equal to the number of transformation trees (i.e., the number of input features plus the iteration inex minus one). As with the first phase, the mean vectors an covariance matrices that efine the sufficient statistics for the linear regression can be calculate using only a single pass over the training an holout ata. The sufficient statistics can likewise be calculate in parallel on isoint ata partitions, with the results then merge using the same merging technique use for tree builing. The above implementation techniques prouce a scalable an efficient algorithm. These techniques have been incorporate into a parallelize version of transform regression that is now available in IBM DB2 Intelligent Miner Moeling, which is IBM s atabase-embee ata mining prouct (Dorneich et al., 2). 9. Conclusions Although the experimental results presente above are by no means an exhaustive evaluation, the consistency of the results clearly emonstrate the benefits of the global function-fitting approach of transform regression compare to the local fitting approach of the unerlying linear regression tree (LRT) algorithm that is employe. Transform regression uses the LRT algorithm to construct a series of global functions that are then combine using linear regression. Although this use of LRT is highly constraine, in many cases it enables better moels to be constructe than with the pure local fitting of LRT. In this respect, transform regression successfully combines the global-fitting aspects of learning methos such as neural networks with the nonparametric local-fitting aspects of ecision trees. Transform regression is also computationally efficient. Only two passes over the ata are require to construct each boosting stage: one pass to buil linear regression trees for all input features to a boosting stage, an another pass to perform the stepwise linear regression that combines the outputs of the resulting trees to form an aitive moel. The amount of computation that is require per boosting stage is therefore between one to two times the amount of computation neee by the LRT algorithm to buil a single level of a conventional linear regression tree when the LRT algorithm is applie outsie the transform regression framework. Another aspect of transform regression is that it emonstrates how Frieman s graient boosting framework can be enhance to strengthen the base learner an improve the rate of convergence. One enhancement is to use the outputs of boosting stages as first-class input features to subsequent stages. This moification has the effect of expaning the hypothesis space through function composition of boosting stage moels. Another enhancement is to moify the base 5

12 learner so that it prouces moels whose outputs are linearly orthogonal to all previous boosting stage outputs. This orthogonality property improves the efficiency of the graient escent search performe by the boosting algorithm, thereby increasing the rate of convergence. In the case of transform regression, this secon moification involve using boosting stage outputs as aitional multivariate inputs to the feature transformation functions h i an h ik. This very same approach can likewise be use in combination with Frieman s tree boosting algorithm by replacing his conventional tree algorithm with the LRT algorithm. It shoul likewise be possible to exten the approach presente here to other tree-base boosting techniques. Transform regression, however, is still a greey hillclimbing algorithm. As such, it can get caught in local minima an at sale points. In orer to avoi local minima an sale points entirely, aitional research is neee to further improve the transform regression algorithm. Several authors (Kůrková, 99, 992; Nerua, Štěrý, & Drkošová, 2; Sprecher 99, 997, 22) have been investigating ways of overcoming the computational problems of irectly applying Kolmogorov s theorem. Given the strength of the results obtain here using the form of the superposition equation alone, research aime at creating a combine approach coul potentially be quite fruitful. References Biggs, D., e Ville, B., an Suen, E. (99). A metho of choosing multiway partitions for classification an ecision trees. Journal of Applie Statistics, ():9-2. Dorneich, A., Nataraan, R., Penault, E., & Tipu, F. (2). Embee preictive moeling in a parallel relational atabase, to appear in Proceeings of the 2st ACM Symposium on Applie Computing, April 2, Dion, France. Frieman, J.H. (2). Greey function approximation: a graient boosting machine. Annals of Statistics 29(5): Frieman, J.H. (22). Stochastic graient boosting. Computational Statistics & Data Analysis 3(): Girosi, F. & Poggio, T. (99). Representation properties of networks: Kolmogorov's theorem is irrelevant. Neural Computation ():59. Han, D.J. (997). Construction an Assessment of Classification Rules. New York: John Wiley an Sons. Hastie, T. & Tibshirani, R. (99). Generalize Aitive Moels. New York: Chapman an Hall. Hecht-Nielsen, R. (97). Kolmogorov's mapping neural network existence theorem. Proc. IEEE International Conference on Neural Networks, Vol. 3,. Jiang, W. (2). Is regularization unnecessary for boosting? Proc. th Intl. Workshop on AI an Statistics, 57. San Mateo, California: Morgan Kaufmann. Jiang, W. (22). On weak base hypotheses an their implications for boosting regression an classification. Annals of Statistics 3():5-73. Kass, G. V. (9). An exploratory technique for investigating large quantities of categorical ata. Applie Statistics 29(2):9-27. Kolmogorov, A.N. (957). On the representation of continuous functions of many variables by superposition of continuous functions of one variable an aition. Doklay Akaemii Nauk SSSR, (5): Translate in American Mathematical Society Translations Issue Series 2, 2:55-59 (93). Kůrková, V. (99). Kolmogorov s theorem is relevant. Neural Computation 3():722. Kůrková, V. (992). Kolmogorov's theorem an multilayer neural networks. Neural Networks 5(3):5-5. Lorentz, G.G. (92). Metric entropy, withs, an superposition of functions. American Mathematical Monthly, 9:95. Mallat, S.G. an Zhang, Z. (993). Matching pursuits with time-frequency ictionaries. IEEE Trans. Signal Processing (2): Nataraan, R. & Penault, E.P.D. (22). Segmente regression estimators for massive ata sets. Proc. Secon SIAM International Conference on Data Mining, available online at Nerua, R., Štěrý, A., & Drkošová J. (2). Towars feasible learning algorithm base on Kolmogorov theorem. Proc. International Conference on Artificial Intelligence, Vol. II, pp CSREA Press. Sprecher, D.A. (95). On the structure of continuous functions of several variables. Transactions American Mathematical Society, 5(3): Sprecher, D.A. (99). A numerical implementation of Kolmogorov's superpositions. Neural Networks 9(5): Sprecher, D.A. (997). A numerical implementation of Kolmogorov's superpositions II. Neural Networks (3):757. Sprecher, D.A. (22). Space-filling curves an Kolmogorov superposition-base neural networks. Neural Networks 5():577.

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction