A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

1 A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine (1900 words) Contact: Jerry Farlow Dept of Matheatics Univeristy of Maine Orono, ME 04469 Tel (07) 866-3540 Eail: farlow@ath.uaine.edu 1 Jerry Farlow, Dept of Matheatics, University of Maine, Orono, ME. 04473, farlow@ath.uaine.edu

Abstract: Logical regression, as described by Ruczinski, Kooperberg, and LeBlanc (003) is a ultivariable regression ethodology for predicting a Boolean variable Y fro logical relationships aong a collection of Boolean predictor variables,... 1. More specifically, one seeks a regression odel of the for g E Y = b + b L + b L + b L (1) ( [ ]) 0 1 1 where the coefficients b0, b1,..., b and the logical expressions L, = 1,..., are to be deterined. The expressions L are logical relationships (Boolean functions having values 0 or 1) aong the predictor variables, such as " 1, are true but not ", or 5 " 3, 5, 7 are true but not 1 or " or "if 1 and are true then 5 is true, where true is taken as 1 and false taken as 0. A aor proble in finding the best odel is the oveent fro one logical expression to another in an effort to find a path to optiality. The authors investigate the use of a greedy algorith as well as the siulated annealing algorith in their search for optiality. In this paper we develop a strategy, based on the self-organizing ethod of Ivakhnenko, to first find a collection of logical relations L k for predicting the dependent variable Y and then use ordinary linear regression to find a regression equation of the for (1) and possibly a ore general odel by adding continuous predictor variables to the ix. An exaple is presented to deonstrate the potential use of this ethod. Key Words: Logical regression, GMDH algorith

3 1. Introduction Our goal is to use a variation of the self-organizing GMDH algorith to find a collection of logical relations Y = L (,,..., ), k = 1,,..., () k 1 for predicting the Boolean variable Y fro the Boolean variables 1,,...,. Although the procedure as described finds the best logical relations, one ost likely is interested only in the best one or two. Inasuch as the GMDH algorith is not known to ost people, we outline the basic ethod so that the adoption to logical regression is better appreciated. For ore inforation on the GMDH algorith the reader can consult the book, Self-Organizing Methods in Modeling, by Farlow (1984) or the ore coplete and recent book, Self-Organizing Data Mining by Johann Mueller and Frank Leke, which can be purchased and downloaded online at http://www.gdh.net/.. The Group Method of Data Handling (GMDH) Algorith One ight say that the GMDH algorith builds a atheatical odel siilar to the way biological organiss are created through evolution. That is, starting with a few basic prieval fors (i.e. equations); one grows a new generation of ore coplex off-springs (equations), then allows a survival-of-the-fittest principle to deterine which off-springs survive and which do not. The idea being that each new generation of offsprings (equations) is better suited to odel the real world than previous ones. Continuing this process for ore generations, one finds a collection of odels that hopefully describes the proble at hand. The process is stopped once the odel begins to overfit the real world, thus stopping when the odel reaches optial coplexity. In 1966 the Ukrainian cyberneticist, A.G. Ivakhnenko, discouraged by the fact that any atheatical odels require knowledge of the real world that are difficult or ipossible to know, devised a heuristic self-organizing ethod, called the Group Method of Data Handling algorith. The GMDH algorith can be broken into a few distinct steps.

4 Step 1 (constructing new variables z1, z,..., z C (,) ) The algorith begins with regression-type data yi, xi 1, xi,..., xi, i = 1,,..., n where y is the dependent variable to be predicted and x1, x,... x are predictor variables. The n observations required for the algorith are subdivided into two groups, one group of nt observations are called the training observations (fro which the odel is built) and the reaining n nt observations (which deterines when the odel is optial) are called checking observations. This is a coon cross-validation strategy for deterining when odels are optial; building the odel fro one set of observations (training set) and checking the odel against independent observations (checking set). See Figure 1. Input Data for the GMDH Algorith Figure 1 The algorith begins by finding the least-squares regression polynoial of the for y = A + Bx + Cx + Dx + Ex + Fx x (3) i i i for each C(, ) = ( 1) / pair of distinct independent variables x, x using the i observations in the training set. These ( 1) / regression surfaces are illustrated in Figure.

5 Coputed Quadratic Regression Surfaces Figure One now evaluates each of the C(, ) regression polynoials at all n data points and stores these values (new generation of variables) in coluns of a new n C(, ) array, say Z. The evaluation of the first regression polynoial and the storage if its n values in the first colun of Z is illustrated in Figure 3. Evaluating the (, ) C Quadratic Regression Polynoials Figure 3

6 The obect is to keep only the best of these new coluns and this is where the checking set coes into play. Step (screening out the least effective variables) This step replaces the original variables (coluns of ) by those coluns of Z that best predict y, based on the training set observations. This is done by coputing for each colun of Z soe easure of association; say the root ean square r given by nt ( yi zi ) i= 1 r = nt, = 1,,..., C(, ) y i= 1 i (4) then selecting those coluns of Z that satisfy r < R, where R is soe prescribed nuber. The nuber of coluns of Z that replace the coluns of ay be larger or saller than the nuber of coluns of, although often one siply chooses coluns of Z to replace the coluns of, thus keeping the nuber of predictor variables constant. Step 3 (test for optiality) We now cross validate the odel by coputing the goodness of fit of new variables sued over the checking set. That is, we copute n ( yi zi ) i= nt+ 1 R = n, = 1,,..., C(, ) y i= nt+ 1 Step we find the sallest of the root ean squares each generation (iteration) this value is plotted as shown in Figure 4. i R s and call it RMIN, and then at

7 Deterining the Optial Polynoial Figure 4 The values of RMIN will decrease for a few iterations (aybe fro 3-5 iterations) but then starts to increase when the odel begins to overfit the observations on which it was built. Hence, one stops procedure when the RMIN curve reaches its iniu and selects the colun of Z having the sallest value of R as the best predictor. When the algorith stops, the quadratic regression polynoials found at each generation has been stored, and hence by coposition one can for a high-order regression polynoial of the for 1 (5) y = a + b x + c x x + d x x x + i i i ik i k i= 1 i= 1 = 1 i= 1 = 1 k = 1 known as the Ivakhnenko polynoial that best predicts Y fro. At each iteration the degree of the Ivakhnenko doubles, and for a p-th order regression polynoial the nuber of ters in the polynoial will be ( + 1)( + ) ( + p) /!. If one started with = 10 input variables and the algorith went through 4 generations, the Ivakhnenko polynoial would be of degree 4 = 16 and would contain ters such as x1x3 x 7. Step 4 (Applying the results of the GMDH Algorith)

8 One doesn t actually copute the coefficients in the Ivakhenko polynoial, but saves the regression coefficients A,B,C,D,E,F at each generation. Hence, to evaluate the Ivakhnenko polynoial and use the odel as a predictor of Y fro new observations, one siply carries out repeated copositions of these quadratic expressions. Figure 5 illustrates this process. Evaluation of the Ivakhnenko Polynoial Figure 5 3. Applying GMDH to Logical Regression We now use the ideas of the GMDH algorith to find logical expressions aong the Boolean variables 1,,..., that best predict of Y. Starting with n observations of Y and 1,,..., we subdivide the observations into nt training observations and nc = n nt checking observations. We then use the observations in the training set to deterine for each C(, ) pair of dependent variables, how well each of binary functions i,,,,,,, (6) i i i i i i i i

9 predicts Y. We do this by assigning a 1 to an observation if Y f (, ), where f (, ) is one of the eight binary functions. Carrying out this operation for each of i the n observation for each of the eight binary functions and each pair of dependent variables yields an n 8 C(, ) atrix of 0 s and 1 s which we call Z. Since the 1 s in the coluns of Z represent correct predictions of Y for a given logical function and pair of predictor variables, we su the coluns of Z in the training set and rank the in descending order and select the largest ones. We have chosen the largest sus. These coluns of Z with the largest sus represent those logical relations between a given two variables that best predicts Y. Typical exaples ight be 3 7 or. 5 7 We now replace the original data by the best coluns of Z. This gives us a new data set which are evaluations of the best logical relationships of the original variables, hence should act as better predictors of Y than the original observations. We then repeat this process again and again, each tie finding new logical relations of the previous variables which in tern are logical relations of earlier variables. Before starting each new iteration however, we check best predicted values of Y (the first colun of the atrix Z in the last nc rows) against the nc observations of Y in the checking set to deterine goodness of fit. When the percentage of correct predictions reaches a axiu, the process is stopped. At this tie we have logical expressions L for estiating Y ordered fro top to botto. We then continue the process by finding the linear regression equation i g( E[ Y ]) = b + b L + b L + b L + a + a W + a W +... + a W (7) 0 1 1 0 1 1 p p where the variables W1, W,..., W p are continuous variables we can (possibly) add to the Boolean variables L1, L,..., L. We then find the coefficients b, a by usual linear regression using the original data Y, along with the data for the continuous variables W1, W,..., W p.

10 4. Siulation of the Process We generated ten data sets, each with 50 observations and ten predictor variables. The predictor variables 1,,... 10 were independent Boolean variables ( p = 0.5 ), and the dependent variable Y was deterined by Y ( A B) ( C D) with probability q = Bernoulli var iable (0.5) with probability 1 q (8) where 0 q 1 was chosen (generally about 0.7) and A, B, C, D were any four of the variables 1,,... 10, possibly repeated, or their negations. Exaple: We generated 50 observations of 10 independent variables, each with a Bernoulli distribution ( p = 0.5 ) and dependent variable Y generated by Y ( 1 ) ( 3 4) with probability q = Boolean var iable(0.5) with probability 1 q Input: input nuber of independent variables 10 input nuber of observations 50 input nuber of observations in the training set 150 input p = Bernoulli variable in dep variables... aybe p =.5.5 input q = probability of picking Y as a logical relation of independent variables.9 input the first of the 4 variables to be used 1 input the second of the 4 variables to be used input the third of the 4 variables to be used 3 input the fourth of the 4 variables to be used 4 input nuber of iterations 5 Output:

11 Logical Regression nuber of variables = 10 nuber of observations = 50 nuber of observations in the training set = 150 nuber of observations in the checking set = 100 Iteration 1 (best 3 predictors) predicts independent variable in the training set 76% of the tie 3 4 predicts independent variable in the training set 75% of the tie 1 predicts independent variable in the training set 68% of the tie 3 best predictor 3 4 predicts checking set dependent variable 75% of the tie Iteration (best 3 predictors) ( ) ( ) predicts independent variable in training set 95% of the tie 1 3 4 ( ) ( ) predicts independent variable in training set 95% of the tie 4 3 ( ) ( ) predicts independent variable in training set 93% of the tie 3 4 1 3 best predictor ( 1 ) ( 3 4) predicts checking set Y 93% of the tie coefficients in linear cobination of logical functions, the first one is a constant ter 0.034356 0.7950 0.10 0.019969 0.0080103-0.077889-0.078816 0.0030564 0.058093 0.0577-0.03567 stopped after iterations References: Farlow, S. J., 1984, Self-Organizing Methods in Modeling: GMDH Type Algoriths, Marcel Dekker. Ruczinski, I., Kooperberg, C., and LeBlanc, M., 003, Logic Regression, Cop. Graph. Statist 1.