17. Introduction to Tree and Neural Network Regression

Size: px

Start display at page:

Download "17. Introduction to Tree and Neural Network Regression"

Georgia Strickland
5 years ago
Views:

1 17. Introduction to Tree and Neural Network Regression As we have repeated often throughout this book, the classical multiple regression model is clearly wrong in many ways. One way that the model is wrong is in the assumption of a normal distribution. Instead, the distribution of Y X = x is correctly given as p(y x), where p(y x) can be any distribution (recall that x = (x1, x2,, xk)). Another way that the classical model is wrong is in the assumption that the conditional mean function is given by E(Y X1 = x1, X2 = x2,, Xk = xk) = 0 + 1x1 + 2x2 + + kxk. Instead, the conditional mean function (assuming the mean is finite) is correctly given as E(Y X1 = x1, X2 = x2,, Xk = xk) = f (x1, x2,, xk), where f (x1, x2,, xk) can have any shape whatsoever in (k + 1) dimensional space, from curved hyperplanes, to functions with bumps, to discontinuous functions, depending upon the Nature of the process that you are studying. As we noted in Chapter 1, Section 1.7, even the simple linear model f (x1) = 0 + 1x1 is usually wrong: We gave a logical argument that there must be some curvature as long as the X1 variable has at least three levels, and as long as there is dependence between Y and X1. This argument becomes more relevant in multidimensional space, where the curvature can take so many forms, including twists, turns and bumps. The goal of neural network and tree regression is to estimate f (x1, x2,, xk) as a general, nonlinear (or non-planar) function. In particular, these methods allow, and can easily find, strong interactions between various combinations of X variables, which can be very tricky to tease out using the classical regression model with interaction terms. By allowing greater generality and flexibility than the restrictive classical (planar) model f (x1, x2,, xk) = 0 + 1x1 + 2x2 + + kxk, these models can, in some cases, also give you predicted values Y ˆ that tend to be closer to future Y values than classical linear models. Since these methods attempt to find unusual model features (twists, turns, and bumps) of the function f (x1, x2,, xk) in high dimensional space, they typically require large amounts of data to be able to estimate such features. With the data revolution, such large data sets are routinely available. Thus, it has become nearly a default position in data science to analyze large data sets using tree and neural network regressions, rather than the classical regression model Tree Regression Not only does tree regression allow complex functional relationships, it also provides output that is very easy to interpret even easier than the classical linear model. Further, the output of tree regression models can provide a clear recipe for action; e.g., in the determination of profiles (particular combinations of X values) that are associated with fraudulent activity. To start, let us show an example of what tree regression can do for you. We will use a data set from a survey of n = 1,020 individuals who were asked how they pronounce data, either as day-tuh or daa-tuh. The raw data show 653/1020 = 64.0% day-tuh and 367/1020 = 36.0%

2 daa-tuh. As in all regression, we would like to know how this distribution (64.0% day-tuh, 36.0% daa-tuh ) changes for particular values of the X variables. In addition to the day-tuh / daa-tuh pronunciation, the survey data also contain demographic information such as gender, age, region of the U.S., and other variables. You can use the rpart function to construct tree regression estimates; rpart is short for recursive partitioning, which describes the algorithm that is used. Code to fit a tree model is as follows: pron = read.csv(" attach(pron) Y = ifelse(q4 ==1, 1,0) library(rpart) ft = rpart(y ~ Q1 +Q2+Q3 + Q5 + Q6 +Q7 +Q8 + Q9 +Q11 + Q12 +Q13 + Age + Gender) plot(ft) text(ft) The results are shown in Figure Figure Results of tree regression to predict pronunciation of data. Numbers below nodes are proportions of day-tuh pronunciation.

3 To interpret Figure , note the following. There is an automatic variable selection algorithm inside the software, and it picked only the variables Q7, Q2, and Age as predictors of pronunciation. The variable Q7 is colleague pronunciation with Q7 = 1 indicating colleague pronounces day-tuh, and Q7 = 2 indicating colleague pronounces daa-tuh. The first split is on Q The left side of the split corresponds to truth of Q7 1.5, which means that Q7 = 2, or colleague pronounces daa-tuh. The right side corresponds to falseness of Q7 1.5, which means that Q7 = 1, or colleague pronounces day-tuh. Below the right split, there is just one entry, , which is the mean of Y for this condition. You can verify this as mean(y[q7 < 1.5]), which gives you Since the means of 0/1 data are proportions, this result means that 86.86% of the respondents who said their colleagues say day-tuh also say day-tuh. Down the left hand split, which initially only considers people whose colleagues say daa-tuh, the next split involves Q2, which is region, and takes the values 1,2,3,4 for West, South, Midwest, and Northeast regions, respectively. The left split again indicates truth of Q2 < 3.5, meaning that the left split contains data from regions 1, 2 and 3. The entry 0.2 means that 20% of the subset determined by (i) colleague says daa-tuh and (ii) person lives in regions 1,2 or 3, pronounce day-tuh. In other words, in this subset, 80% pronounce data like their colleagues do. To verify this result manually, enter the command mean(y[(q7 >= 1.5) & (Q2 < 3.5)]). The right hand split on Q2 < 3.5 indicates that the condition is false, so that Q With these data, the only possibility here is that Q2 = 4, i.e., people live in the northeast region. Among these people (whose colleagues say daa-tuh because of the higher-level split), younger people are likely to ignore their colleagues (57.78% say day-tuh ), but older people are likely to mimic their colleagues (18.42% say day-tuh. ) This is an example of the kind of high-level interaction that tree regression can tease out: The effect of colleague s pronunciation of data depends on both region and age, a three-way interaction. (Self-study problem: Find the percentages 57.78% and 18.42% by hand. ) As you can see, tree regression is very useful for identifying pockets based on the X variables where the Y variable differs greatly from pocket to pocket. If the Y variable was an indication of fraudulent activity, the pockets of observations where Y is high, as shown by the tree output, would identify specific observations to investigate further. To understand what the tree regression model is doing in more detail, consider the charity data set again, and consider the prediction of Y = charitable contributions (in log scale), as a function of a single X variable, Income (again in log scale). The following R code shows the default analysis from rpart, as well as a simplified analysis that will facilitate explanation. ## Initial analysis. char = read.csv(" attach(char) library(rpart)

4 fit1 = rpart(charity ~ INCOME) plot(fit1); text(fit1, cex=.7) This code produces the following tree: Figure Tree regression using rpart to predict log charitable contributions in terms of log income (denoted INCOME in the figure). Figure tells you that, as you might expect, there tends to be higher charitable contributions among people with higher income. But rather than express that relationship as a linear function, as in the classical regression model, the tree regression expresses that relationship in terms of ranges of values of INCOME where contributions are either lower or higher. To understand the method further, consider the following code, which provides a simplified analysis. ## Simplified analysis. char = read.csv(" attach(char) library(rpart) fit2 = rpart(charity ~ INCOME, maxdepth=1) plot(fit2); text(fit2, cex=.7)

5 Figure Tree regression using rpart with maxdepth=1 to predict log charitable contributions in terms of log income (denoted INCOME in the figure). In Figure , notice that we are just looking at the first split from Figure The split is the same, at INCOME < Notes on these two analyses are as follows: The split value, 10.91, is chosen to maximize separation between the Y variable (here, log charitable contributions), between the two groups. Different measures of separation are possible, such as the F statistic or the log likelihood statistic. The default measure of separation used by rpart is somewhat unusual (the Gini coefficient). The method is called recursive partitioning because the further splits shown in Figure use the same splitting method as with the first split, except that the method is used on the subsets of data. For example, after the split on INCOME < 10.91, the subset of data where INCOME is < is subject to the same splitting algorithm, finding the next optimal split as INCOME <10.32 as shown in Figure The right hand side of that split indicates INCOME 10.32, but since this split is down the path where INCOME < 10.91, the actual range of INCOME in this group is INCOME < The predicted values ( Y ˆ ) from the model are constant over the intervals as shown in the following code and Figure.

6 R Code for Figure Income.plot = seq(8.5, 12,.001) Income.pred = data.frame(income.plot) names(income.pred) = c("income") Yhat1 = predict(fit1, Income.pred) Yhat2 = predict(fit2, Income.pred) par(mfrow=c(1,2)) plot(income, CHARITY, pch = ".", cex=1.5) points(income.plot, Yhat1, type="l", lwd=2) abline(lsfit(income, CHARITY), lwd=1.5, lty=2) plot(income, CHARITY, pch=".", cex=1.5) points(income.plot, Yhat2, type="l", lwd=2) abline(lsfit(income, CHARITY), lwd=1.5, lty=2) Figure Scatterplots of (ln(income), ln(charitable Contributions)), with predictions of mean charitable contributions using tree regression (solid lines) and ordinary least squares (dashed lines). Left panel: Default tree from rpart. Right panel: Tree pruned to depth = 1. As Figure shows, the tree regression estimates of the conditional mean function are flat line segments over interval ranges. Clearly, these functions are not very good as estimates of the true mean function, because Nature favors continuity over discontinuity. On the other hand, the results are very simple to use and interpret. In multiple regression, the tree functions are not flat lines, but instead flat planes or hyperplanes, which are defined over various rectangular regions defined by combinations of the X variables. Picture a city building that has been built in pieces, with different parts added on over time, all with different heights. That s what the tree regression function looks like in multiple regression.

7 The most interesting and useful applications of tree regression occur when there are many X variables, and where the nature of the relationship between Y and the X variables is highly nonlinear, involving high level interactions. With simpler, well behaved examples, such as the regression of CHARITY on INCOME as shown above, the simple linear regression is obviously much better as shown in Figure Neural Network Regression Neural network regression has a similar goal as tree regression; namely, to estimate conditional mean functions that are highly nonlinear, perhaps also involving high-level interactions. An advantage of neural networks over trees is that the functions are estimated as continuous functions rather than discontinuous flat line (or flat plane) segments. Thus, neural network estimates are generally more realistic than tree estimates, because, again, Nature generally favors continuity over discontinuity. A disadvantage of the neural network estimates is that the results are not simple to interpret. Trees are the easiest to interpret, classical regression next easiest, and neural nets are hardest. Thus, instead of trying to interpret the results, users of neural nets usually just treat them as a black box to produce a predicted mean, ˆ Y = ˆf (x1, x2,, xk). Like LOESS, the neural net regression function ˆf (x1, x2,, xk) has a more complicated form than other regression functions, and researchers therefore do not ordinary examine its function form. There are many practical applications where you really do not care about the function form of ˆ Y ; all you care about is getting a ˆ Y that is close to Y. Examples include: The predicted range of your car (in miles or kilometers), which predicts how many miles (or km) your car can travel, given current fuel in the tank, and current (and recent) fuel economy. The predicted EEG reading of a heart patient, given a current (more simply obtained) reading of pulse, sweat, and blood pressure. The credit worthiness of a customer who is applying for a loan (think of your credit score ). These examples are all cases where you want a good prediction of Y, but you don t necessarily care how that prediction was obtained. In such cases, neural network regression can be useful. Universal Approximators First, a word about function approximation. A main goal in regression is to approximate the conditional mean function E(Y X1 = x1, X2 = x2,, Xk = xk) = f (x1, x2,, xk). As noted repeatedly throughout this book, linear, quadratic, interaction, exponential, and other functions are nearly always different from the true f (x1, x2,, xk). But some types of functions can be made arbitrarily close to f (x1, x2,, xk); such functions are called universal approximators.

8 Polynomial functions are universal approximators. That is, given any function f (x1, x2,, xk) defined over a bounded X set, you can approximate f (x1, x2,, xk) arbitrarily well by a polynomial function g (x1, x2,, xk). You might need an extremely high order of polynomial (recall that such functions involve all possible interaction terms up to the given order), but still, there exists a polynomial g(.) such that the maximum difference between f (.) and g (.) is as small as you would like. This result is mathematically proven as Stone-Weierstrauss Theorem. While the Stone-Weierstrauss Theorem seems to suggest that you should use high order polynomial models to estimate regression functions, you know from the variance-bias trade-off that, while such higher order models will be less biased (because of the Stone-Weierstrauss Theorem), such models also require a huge number of estimated parameters. Hence, high order polynomial models will suffer from extremely high variance when estimated using data. In addition, polynomial models can have extremely poor behavior at the extremes of the data, or where data are sparse. At the extremes, the predictions shoot off quickly to positive or negative infinity, because that s how high-order polynomials terms like x 3, x 4, x 5 etc. behave. Where data ae sparse, the behavior can also be erratic: See Figure , for example. There are universal approximators other than polynomials that do not have such wild behavior. Fourier analysis gives you functions g (.) involving sines and cosines that can approximate general functions f (.); such Fourier functions g (.) are also universal approximators, but do not have the wild extrapolation properties of polynomials. Being based on sines and cosines, Fourier functions g (.) are best 1 for approximating functions f (.) that are cyclical; but like polynomials, they can also approximate any function arbitrarily well. There are infinitely many universal approximators, depending of the particular set of basis functions that you use. In polynomial approximators, the basis functions are polynomials; in Fourier analysis, the basis functions involve sines and cosines. Pick an appropriate set of basis functions, and voilá! You can construct a universal approximator. Neural network is just a fancy-sounding name (meant to suggest that it somehow operates like the brain!) for a universal approximator that involves a particular type of basis function, called an activation function in the neural network jargon. The standard activation function is the same one that you use in logistic regression; namely 1/(1 + e x ). Like polynomial functions are constructed of functions involving various orders of polynomial terms, their interactions, and constant multipliers, neural network functions are constructed of logistic functions, their interactions, and constant multipliers. Example: Predicting Charitable Contributions Using a Neural Network Let s see how the neural network works in a real example first, and then we ll explain what it is doing. 1 Best means that they require fewer terms to arrive at a good approximation. Statistically, such parsimony is preferred because there will be fewer constant multipliers to estimate.

9 R Code for Figure char = read.csv(" attach(char) f = as.formula(paste("charity ~ INCOME")) dat = data.frame(charity, INCOME) library(neuralnet) fit.nn = neuralnet(f, data=dat, hidden=c(1,1)) Income.plot = seq(8.5, 12,.001) Income.pred = data.frame(income.plot) names(income.pred) = c("income") Yhat.nn = compute(fit.nn, Income.plot)$net.result plot(income, CHARITY, pch = ".", cex=1.5) points(income.plot, Yhat.nn, type="l", lwd=2) abline(lsfit(income, CHARITY), lwd=1.5, lty=2) Figure Scatterplot of (ln(income), ln(charitable Contributions)) data, with neural network (solid) and ordinary least squares (dashed) fits. Figure shows that the neural network fit is akin to LOESS in that it allows curved relationships. The flattened appearance on the left is due to the use of the logistic basis functions. (Still need to explain what it is doing).

Final Exam. Name: Solution:

Final Exam. Name: Solution: Final Exam. Name: Instructions. Answer all questions on the exam. Open books, open notes, but no electronic devices. The first 13 problems are worth 5 points each. The rest are worth 1 point each. HW1.