Linear Regression. Furthermore, it is simple.

Size: px

Start display at page:

Download "Linear Regression. Furthermore, it is simple."

Shana Dickerson
5 years ago
Views:

1 Linear Regression While linear regression has limited value in the classification problem, it is often very useful in predicting a numerical response, on a linear or ratio scale. Furthermore, it is simple. Further-furthermore, performing linear regression analysis involves many important principles of data analysis. So it s a good place to start. 1

2 The Model In the basic problem, we have some predictor variables and a response. Let p be the number of predictors. We have n observations on all of these. So we have an n p matrix of predictors; call it X. And we have an n-vector of numerical responses; call it Y. A model for this situation is where f is some function. Y f(x), In linear regression, we have where β is an unknown vector. Y Xβ, 2

3 The Model Instead of Y Xβ, let s write Y = Xβ + ɛ, where ɛ is an n-vector of latent variables we ll call errors. We treat ɛ as a random vector. Our problem is to find β. 3

4 Assumptions about the Error Term in Linear Regression We make some simple assumptions about the distribution of the random vector ɛ. At the least, we assume E(ɛ) = 0 V(ɛ) = σ 2 I, where I is the identity matrix. We often also add an additional assumption: We may assume that ɛ has a multivariate normal (Gaussian) distribution. As we discuss the linear regression more, it is important to recognize whether or not we make the additional assumption of normality (for example, for t tests). 4

5 Simple Linear Regression Just as regression in general can be used to illustrate many of the principles of modeling in data science, the simple linear model can be used to illustrate many of the principles and techniques of multiple linear regression. It s easy to visualize. The model, one observation at a time, is where we assume, at the least, y i = β 0 + β 1 x i + ɛ i, E(ɛ i ) = 0 V(ɛ i ) = σ 2 i i (finite and constant) Cov(ɛ i, ɛ j ) = 0 i, j; j i (These are the same as on the previous slide, except these are written in term of the individual elements of the vector.) We may also make a further assumption that the errors have a normal distribution. 5

6 Notation What is X if we write the model as y = Xβ + ɛ. This is always a problem. Should we write y = β 0 + x T β + ɛ. How do we write the constant (the intercept)? I don t have an opinion; I just want you to be aware of the possibilities. Different authors do it differently, and I even do it differently at different times. There is an interesting property of a least-squares fit that is relevant to this issue, as we will see. 6

7 Simulated Data Simulation of data is very useful, both for teaching and for research. In research, it allows us to study various scenarios ( Monte Carlo study). In teaching, it just gives us some numbers and pictures to look at. Let s simulate some data for simple linear regression. 7

8 Simulated Data We ll generate some data from this model. y i = β 0 + β 1 x i + ɛ i First, we need to decide on what s random. Only ɛ i. Now we need to decide on a probability distribution for ɛ i. 8

9 Simulated Data The distribution must be consistent with the three assumptions mentioned earlier. Independence = Cov = 0. So that s easy to satisfy. Let s use ɛ i iid N(0,σ 2 ). The mean is finite and constant. Now, what about the x i? Let s just let them be randomly distributed over (0,10). Let β 0 = 0.8 and β 1 =

10 Simulate the Data in R and Plot It set.seed(555) beta0 <- 0.8 beta1 <- 1.2 xlo <- 0 xhi <- 10 n <- 20 eps <- rnorm(n) x <- runif(n,xlo,xhi) y <- beta0 + beta1*x + eps plot(x,y, main="simulated Data") Save the plot. (saveplot is specific to MS Windows.) setwd( c:/isye6740_course/l03 ) saveplot( Fig01L030505,type="ps") 10

11 Simulated Data y x 11

12 How the Data Compare to the (Known) Model Plot the line representing the true model, and draw lines representing the errors. abline(beta0,beta1,col="green") title("true Model") for (i in 1:n) lines(c(x[i],x[i]),c(y[i],beta0+beta1*x[i]), col="green") Save plot. 12

13 True Model y x 13

14 Fitting the Data Now, suppose we have the data, but don t know β 0 and β 1. This is the usual case, of course. Let s just try a line with β 0 = 1 and β 1 = 1. The plot function starts a new graph. plot(x,y) b0 <- 1 b1 <- 1 abline(b0,b1,col="red") title(expression( paste("model with ", beta[0],"=1, ", beta[2], "=1"))) for (i in 1:n) lines(c(x[i],x[i]),c(y[i],b0+b1*x[i]),col="red") Save plot. 14

15 Model with β 0 =1, β 2 =1 y x 15

16 Fitting the Data Doesn t look good. Residuals are bad. They are not balanced. Look at the sum of the squares of the residuals: sum((y - (b0+b1*x))^2) This is called the residual sum of squares, RSS, for this model, that is, for these two values of β 0 and β 1. 16

17 RSS for the (Unknown) True Model What about the sum of the squared residuals for the true model (which we don t know)? sum((y - (beta0+beta1*x))^2) The sum of squares provides an indication for comparative purposes of how well the model fits the data. 17

18 Least Squares Fit OK. Let s fit the model by determining β 0 and β 1 in such a way as to make the sum of squared residuals small. Least squares. An R function that will do this is lm. It generates an R object, which I will name fit, that has all kinds of information in it. 18

19 Least Squares Fit Here s what my R console looks like: > fit <- lm(y~x) > summary(fit) Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * x e-12 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 18 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 249 on 1 and 18 DF, p-value: 5.502e-12 19

20 Least Squares Fit We can can see what kinds of information are the fit object using the function names. > names(fit) [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" We can access individual pieces of the fit object using the extractor $. For example, > fit$coefficients (Intercept) x We can abbreviate the names; coef, for example. 20

21 Least Squares Fit What is the RSS for this fit? > b0hat <- fit$coef[1] > b1hat <- fit$coef[2] > sum((y - (b0hat+b1hat*x))^2) [1] Even better than true model (which, of course, we don t know). That s because it s a least squares fit. 21

22 Least Squares Fit Plot it, as before. Also plot the mean ( x, ȳ). plot(x,y) abline(b0hat,b1hat,col="blue") title("least Squares Fit") for (i in 1:n) lines(c(x[i],x[i]),c(y[i],b0hat+b1hat*x[i]), col="blue") points(mean(x),mean(y), pch="+", cex=1.5, col="red") legend("bottomright", c("mean"), pch="+", col="red") 22

23 Least Squares Fit y mean x 23

24 Properties of the Least Squares Estimators First, what are the estimators? Differentiate (yi (b 0 + b 1 x i )) 2 with respect to b 0 and b 1 and set equal to 0. Get two equations. (Exercise.) Solve them and call the solution β 0 and β 1. How do you know the solution of the equations is a minimum? 24

25 Properties of the Least Squares Estimators The solutions, which are our estimators, written in terms of x i x and y i ȳ, are (xi x)(y β 1 = i ȳ) (xi x) 2 β 0 = ȳ β 1 x We see the first important property: The least-squares line goes through x and ȳ. Therefore, you often see the derivation in terms of x i x and y i ȳ. This is a result, not an a priori requirement. 25

26 Properties of the Least Squares Estimators From the expressions for β 0 and β 1 we can figure out all kinds of things. If we assumed that the distribution of ɛ i is normal, from our original model, we have y i N(β 0 + β 1 x i, σ 2 ). Likewise, we can work out the distribution of ȳ. 26

27 Properties of the Least Squares Estimators From the distribution of y i and ȳ, we can get the distribution of β 0 and β 1. It is also easy to work out any property, such as E( β 1 ), for example ( ) E( β (xi x)(y 1 ) = E i ȳ) (xi x) 2 = = = = = β 1 1 ( (xi x) 2E (xi x)(y i ȳ) ) 1 (xi x) 2 (xi x)(e(y i ) E(ȳ)) 1 (xi x) 2 (xi x)(β 0 + β 1 x i β 0 β 1 x)) 1 (xi x) 2 (xi x)β 1 (x i x) 27

28 Properties of the Least Squares Estimators So we see β 1 is an unbiased estimator of β 1. Likewise, we can see β 0 is an unbiased estimator of β 0. (Exercise: work that out.) In similar fashion, we can work out the variances of β 0 and β 1. We can even work out their distributions. They are normal under our assumption that the ɛ i are normal. 28

29 Summary of Properties Using Matrix Notation We will use the full model with intercept (that is where X has a column of 1 s). Sum of squared residuals: Derivative set equal to 0: y = Xβ + ɛ (y Xβ) T (y Xβ) X T Xβ X T y = 0 Solution: β = (X T X) 1 X T y Expected value E( β) = (X T X) 1 X T Ey) = (X T X) 1 X T Xβ = β Predicted values ŷ = X(X T X) 1 X T y The hatrix X(X T X) 1 X T is called the hat matrix. Residuals r = (I X(X T X) 1 X T )y 29

30 R Function After the function lm has been used, and a object of class fit has been created, various things can be extracted from the fit object. Suppose we have used a statement like myfit <- lm(... Confidence intervals for the coefficients are given by confit(myfit) The predicted values (a vector) are given by predict(myfit) The residuals are given by residual(myfit) The studentized residuals (what are they?) are given by rstudent(myfit) The values along the diagonal of the hat matrix are given by hatvalues(myfit) 30

31 Interesting Fact about Residuals X T r = X T (I X(X T X) 1 X T )y = (X T X T X(X T X) 1 X T )y = (X T X T )y = 0. So the residuals are orthogonal to every column of X. That means they have zero correlation with each X variable. That means the sum of residuals is 0. 31

32 Testing Hypotheses Concerning the Coefficient Estimate Knowing the distributions of β 0 and β 1, allows us to develop a test of the hypothesis versus the alternative for example; it is a t test. H 0 : β 1 = 0 H a : β 1 0 It uses the standard error. (Review: What s a t test? What a standard error?) If we do not reject this hypothesis; that is, if β 1 = 0 in the model, there s really no linear relation between the predictor and the response. 32

33 t Tests The results of a t test are shown in the summary for fit in a previous slide: Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) * x e-12 *** --- Signif. codes: 0 *** ** 0.01 * Can you interpret these? Would we reject the null hypotheses? 33

34 The Accuracy of the Model Even if we reject the null hypothesis, and conclude that there is a linear relation between the predictor and the response, the question still remains of how good the model fits the data. As we have seen, the RSS tells which model is better than another one. To make it more meaningful, however, we need to standardize it in some way. First of all, it is clear that RSS gets bigger, the more observations that we have, so it would be a good idea to scale it back by n, the sample size. 34

35 The MSE and the RSE It can be proven (we won t do that here) that the mean squared error, MSE = RSS n p, is an unbiased estimator of σ 2 (assuming all assumptions for the model are satisfied. We call the square root of the MSE, the residual standard error, or RSE: RSE = RSS n p. While MSE is unbiased for σ 2, RSE is not unbiased for σ. 35

36 R 2 Another way of assessing the accuracy of the model is by comparing the RSS with the total variation, which we measure by the total sum of squares TSS, that is, the sum of the squared deviations of the response from its mean, without the linear model: TSS = (y i ȳ) 2. Obviously, if RSS is almost as large as TSS, the model did not explain very much of the variation in the response. A measure of how much variation it does explain is called R 2. It is defined as R 2 = 1 RSS TSS. 36

37 R 2 It ranges from 0 to 1. (Remember ȳ = β 0 + β 1 x and RSS is the minimum of anything of the form (y i b 0 b 1 x i ) 2.) It is often expressed as a percentage. For good fits the value of R 2 varies widely depending on the application. In the social sciences, an R 2 of 60% may indicate a very goodfitting model. In engineering applications, an R 2 less than 90% may indicate poor good-fitting model. 37

38 We chose to fit the model Other Criteria for Fitting the Simple Linear Regression Model y i = β 0 + β 1 x i + ɛ i by finding values of β 0 and β 1 that minimize (yi β 0 β 1 x i ) 2. Least squares. The properties of the least-squares estimators which we can use to test hypotheses and make other types of inference that we have discussed depend on the way we obtained these estimators. It might make sense to use another criterion for fitting the regression equation. For example, we may choose to find values of β 0 and β 1 that minimize yi β 0 β 1 x i p, for some p 1. Called L p estimators. 38

39 L 1 Estimators for the Simple Linear Regression Model We want to minimize the sum of absolute values of the residuals: yi β 0 β 1 x i. The R function l1fit in the L1pack package does this. library(l1pack) plot(x,y) fitl1 <- l1fit(x,y) b0l1 <- fitl1$coef[1] b1l1 <- fitl1$coef[2] abline(b0l1,b1l1,col="blue") title("least Absolute Values Fit") for (i in 1:n) lines(c(x[i],x[i]),c(y[i],b0l1+b1l1*x[i]), col="blue") points(mean(x),mean(y), pch="+", col="red") 39

40 Least Absolute Values Fit y x 40

41 y LS LAV x 41

42 The Principles and Procedures of Simple Linear Regression The methods we have used and illustrated in the simple regression model apply to other regression models. A basic procedure in any analysis is to partition the total variation into two parts: We can measure variation in different ways, such as squared deviations or absoulte variations for example. 42

43 The Principles and Procedures of Simple Linear Regression and Least Squares The most common measure of variation is the squared deviations from a mean. We partition the variation as TSS = TSS-RSS + RSS = explained + residual We observed several things about least squares in simple linear regression: The coefficient estimators are unbiased. The MSE is unbiased for the error variance. The least squares fit goes through the mean point x, ȳ). 43

44 The Roles of Training and Test Data Given a model and some data, we use the data to train the model (to fit it). The best fit is obtained by using all of the data. After fitting the model, we can look at things like R 2, but those really don t tell us if the model is good. Suppose we are unsure what kind of model? Use some of the data to fit it ( training data ) and hold back some to see how well the fitted model fits this test data. In most applications of machine learning, the use of two subgroups of the data, the training set and the test set, is important. It is not done so often in regression analysis, but the idea is valid and it is straightforward. 44

45 Multiple Linear Regression The model is y = Xβ + ɛ, where y is an n-vector, X is an n p matrix and ɛ is an n-vector. V(ɛ) = σ 2 I Recall the possible ambiguities regarding the intercept; that is, a column of 1 s in X. Remember the difference in multiple linear regression and multivariate linear regression. 45

46 Multiple Linear Regression All of the things we did with the simple linear regression model apply to this. For least squares, we minimize (y Xβ) T (y Xβ) We expand this, differentiate with respect to β, and set equal to 0. We get the system of p equations in p unknowns: X T Xβ X T y = 0 These are called the normal equations. 46

47 Least-Squares Estimators and Their Properties These are very similar to those of the simple linear regression model. Remember, there are other kinds of estimators! But the LS estimators are easiest and also have some of the nicest properties. The least-squares estimators are solutions to the normal equations: β = (X T X) 1 X T y, if (X T X) is nonsingular, that is, the rank of X is p; otherwise, β = (X T X) + X T y. (Pseudoinverse; in simple linear regression, this would be like the two equations were the same; happens iff x is constant.) 47

48 Least-Squares Estimators and Their Properties Let s assume the rank of X is p ( full rank ), so we will write β = (X T X) 1 X T y. Comment: this is not the way to compute the estimators. The computer algorithms are different. Because y N(Xβ, σ 2 I), ( β N (X T X) 1 X T Xβ, σ 2 (X T X) 1 X T ( (X T X) 1 X T) T ). Simplify: β N ( β, σ 2 (X T X) 1). This tells us everything we need to know to make inferences about the data-generating process: test hypotheses, etc. 48

49 Statistical Inference We can do the same kind of things we did with the simple linear regression model. y i = β 0 + β 1 x 1i + + β p x pi + ɛ i We can test individual hypotheses, such as or Each is a t test. H 10 : β 1 = 0 H 20 : β 2 = 0. The problem is that the tests are not independent of each other. 49

50 Statistical Inference We might consider an hypothesis of all βs at once: H 0 : β 1 = = β p = 0; that is, β = 0 versus H a : = β j 0, for at least one β j ; that is, β 0. This requires a different kind of test (because the distribution of β is different; it is multivariate normal). F test. 50

51 Statistical Inference F = (TSS RSS)/p RSS/(n p 1) (What s the 1 for?) F with p and n p 1 degrees of freedom. To test just q of the coefficients, fit the model without those variables, get the RSS; call it RSS o. Then form F = (RSS o RSS))/q RSS/(n p 1) F with q and n p 1 degrees of freedom. 51

52 Prediction For a specific set of X values, say x 0 (this is a p-vector) ŷ = x T 0 β What s expected value? Unbiased. E(ŷ) = E(x T 0 β) = x T 0 E( β) = x T 0 β What s the variance? V(ŷ) = V(x T 0 β) = x T 0 V( β)x 0 = x T 0 (XT X) 1 x 0 σ 2. 52

53 Simulate Data in R and Fit It set.seed(555) beta0 <- 0.8 beta1 <- 1.2 beta2 <- 2.8 beta3 <- 4.2 xlo <- 0 xhi <- 10 n <- 20 eps <- rnorm(n) x1 <- runif(n,xlo,xhi) x2 <- runif(n,xlo,xhi) x3 <- runif(n,xlo,xhi) y <- beta0 + beta1*x1 + beta2*x2 + beta3*x3 + eps fit3 <- lm(y~x1+x2+x3) 53

54 > summary(fit3) Call: lm(formula = y ~ x1 + x2 + x3) Fit of Model with Simulated Data Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) x e-10 *** x e-15 *** x < 2e-16 *** --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 16 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 1679 on 3 and 16 DF, p-value: < 2.2e-16 54

55 Are All of the Predictor Variables Important? Forward selection: Find the best one; find the next best; and so on. How? Backward selection: Put all in the model; remove the least important; remove the next least; and so on. How? Stepwise selection: Find the three best ones; find the least important of those; decide whether or not to remove it; then find the next best one to add; and so on. All best: Find the single variable; find the two best; find the three best; and so on. We ll talk more about the criteria later. 55

56 Qualitative Predictors Variables are not numeric, e.g. sex, male or female Create numeric variable Z that takes values (0,1), (1,2) or (-1,1) for example. Can do the regression in the usual way. How about 3 cases: red, green, blue. Two dummy variables, Z 1 and Z 2 Z 1 = Z 2 = 1 if red 0 if not red 1 if green 0 if not green 56

57 Polynomial Regression y i = β 0 + β 1 x i + β 2 x 2 i + ɛ i This is essentially a linear model: y = Xβ, where 57

58 Interactions among Predictors y i = β 0 + β 1 x 1i x 2i + β 2 x 3i + ɛ i This is different from a linear model. 58

59 The Regression Model Statement in R Suppose we have a response y and predictors x 1 and x 2. To tell R we want the model y i = β 0 +β 1 x 1i +ɛ i, after appropriately assigning the variables in R, we use lm(y~x1) For the model y i = β 1 x 1i + ɛ i (no intercept), use lm(y~x1-1) or lm(y~x1+0) For the model y i = β 0 + β 1 x 1i + β 2 x 2i + ɛ i, use lm(y~x1+x2) For the model y i = β 0 + β 1 x 1i x 2i + ɛ i (interaction), use lm(y~x1:x2) For the model y i = β 0 +β 1 x 1i +β 2 x 2i +β 3 x 1i x 2i +ɛ i ( full factorial ), use lm(y~x1*x2) same as lm(y~x1+x2+x1:x2) For the model y i = β 0 + β 1 x 1i + β 2 x 2 1i + ɛ i (polynomial regression), use lm(y~x1+i(x1^2)) 59

60 Nonlinear Data y i = β 0 + β 1 e β 2x i + ɛ i. Some of the same least squares methods apply, but there are differnces. Don t have unbiasedness (usually); don t have t and F distributions. 60

61 Correlated Errors Time series Cor(ɛ i, ɛ i+20 ) maybe 0, but Cor(ɛ i, ɛ i+1 ) not 0. Serial correlation. Ways of dealing with this (fit an ARMA model, e.g.) 61

62 Nonconstant Variance of Errors Variance is larger for larger values of the response; common problem. Scale data when fitting. Suppose in we have y = Xβ + ɛ, V(ɛ) = Σ, that is, nonconstant variance and correlated errors. ( β = X T Σ 1 1 X) X T Σ 1/2 y Generalized least squares. 62

63 Outliers (in Response) Recall the simple linear model we began with: y i = β 0 + β 1 x i + ɛ i Consider the same dataset as before, except that one of the observations doesn t fit the model well. It is an outlier. 63

64 Outlier in y y x 64

65 We fit it with least squares. 65

66 Outlier in y y x 66

67 Fit good data (without the outlier). 67

68 Outlier in y y with outlier without x 68

69 Fit with L1. 69

70 LS Fit LAV Fit y y x x 70

71 High Leverage Points (Outliers in Predictors) The estimators are affected by the outlier, but there s really not a whole lot of difference. Now consider a slightly different problem. 71

72 Outlier in x y x 72

73 The Effect of High Leverage Points The response at the outlier in the predictors wields more influence. The hat value (the diagonal value in the hat matrix that corresponds to this point in the predictor space) is a measure of this relative influence. 73

74 Outlier in x y x 74

75 Look at the effect of an outlier at a high leverage point. 75

76 Outlier in x y x 76

77 The Least Squares Residuals Do Not Have the Same Variance One of our fundamental assumptions is that the errors all have the same variance; V(ɛ i ) = σ 2 for all i. ɛ i = y i β 0 + β 1 x i How about the least squares residuals? V(r i ) depends on x i x. r i = y i β 0 + β 1 x i 77

78 Multicollinearity Multicollinearity is the situation where one vector of predictors is almost a linear combination of the others. Three vectors, x 1, x 2, x 3. Independent iff there do not exist scalars c 1, c 2, c 3 with some c i 0 such that c 1 x 1 + c 2 x 2 + c 3 x 3 = 0. Suppose independent but x 3 = a 1 x 1 + a 2 x 2 + d = 0, where d is a vector not equal to 0. What is d is very small. multicollinearity Multicollinearity is not a binary quaality; it exists in various degrees. 78

79 Multicollinearity Predictor variables with strong multicollinearity result in needlessly large variance of the coefficient estimators. The increased variance associated with each coefficient estimator depends on how strongly the corresponding predictor variable is linearly related to the other predictors. We measure this by R 2. 79

80 Multicollinearity For the j th predictor, consider, x ji = α 0 + α 1 x 1i + + α p x pi + ɛ i, where x j is not included on the right-hand side. The R 2 for this regression is a measure of the strength of the linear relationship between x j and the other predictors. Let R 2 x j x j be that R 2. We define the variance inflation factor, VIF, for that coefficient estimator as VIF( β 1 j ) = 1 R x 2. j x j 80

81 An Application: Assessing the Effect of Advertising in a Marketing Plan The Advertising dataset from the book s web site. Read Section

82 Linear Regression and KNN In the model Y f(x), for linear regression, we f be a linear function, and we can write Y Xβ, where β is an unknown vector. To fit the linear regression model, we estimate β by some method, maybe least squares, for example. To fit the model by K-nearest neighbors, KNN, at any point, we use the average response of the K nearest points in the predictor space. Let s consider an example in simple regression (one predictor). 82

83 Consider our Simulated Data Simulated Data y x 83

84 KNN Prediction Let K = 3. Now, let s find ŷ(x) when x = 4, and when x = 8. The R code below will do that and then plot it. I ve written a simple function, just to illustrate function writing in R. 84

85 KNNprediction <- function(x,y,k,x0){ dist <- abs(x-x0) Kn <- order(dist)[1:3] yhatx0 <- mean(y[kn]) return(yhatx0) } K=3 x0 <- 4 arrows(x0, 1, x1 = x0, y1 = 0, length = 0.1, col="red") yhatx0 <- KNNprediction(x,y,K,x0) lines(c(x0-.2,x0+.2), c(yhatx0,yhatx0), col="red") x0 <- 8 arrows(x0, 1, x1 = x0, y1 = 0, length = 0.1, col="red") yhatx0 <- KNNprediction(x,y,K,x0) lines(c(x0-.2,x0+.2), c(yhatx0,yhatx0), col="red") 85

86 KNN Predictors at x = 4 and at x = 8 Simulated Data y x 86

87 KNN Regression The length of the horizontal lines depends on the next nearest neighbors. Continuing in this fashion the regression fit would be a step function. The idea of course extends to higher dimensions. We would get a set of hyperplanes to make a step function. It works whether the data are linear or not. 87

Machine Learning for OR & FE

Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the