Notation: Means pencil-and-paper QUIZ Means coding QUIZ Agree or disagree: Regression can be always reduced to classification. Explain, either way! A certain classifier scores 98% on the training set, but only 55% on the testing set. What do you think happened? A certain classifier scores 59% on the training set, and 60% on the testing set. What do you think happened? A certain classifier scores 61% on the training set, and 85% on the testing set. What do you think happened? A certain classifier scores 93% on the training set, and 90% on the testing set. What do you think happened? Explain the difference between synthetic and empirical datasets. Why do we need both? Write code to create a wave dataset with 10 points. Display them in a plot. When extending the Boston dataset with the engineered features, the nr. of features exploded from 13 to 104. Explain! (Hint: Pairs of features are multiplied.) Why do we call KNN a lazy classifier? II. Linear Models (pp.47-70) Prediction is achieved by means of a linear function of the features, i.e. a function involving only additions and multiplication by constants. Linear models for regression When the target feature y is predicted based on only one other feature x[0], we have the simple formula This is the equation of a line, using the well-known slope (w[0]) and y-intercept (b). 1. Linear regression (ordinary least-squares = OLS)
A.k.a. ordinary least squares, because the parameters w and b are chosen such as to minimize the sum of the squares of the errors between the training data points and the values predicted by the model. This is known in statistics as the R.M.S. (Root Mean Square) Error, or RMSE: RMSE = Explain on the plot above! Calculate RMSE for the 3-point dataset and the line shown: intercept is always a scalar: It is the interception/intersection point between the line/plane/hyperplane and the target axis. coef is a scalar (the slope) for a 1D model, but it is in general a 1D numpy array with as many elements as there are non-target features in the dataset: The relatively low train and test scores indicate underfitting, but here we have no choice, b/c in OLS we cannot control the model complexity. p.49: Linear regression has no parameters [set by the user]
A better way to put it is: OLS has no hyper-parameters.
Boston housing dataset: What is your diagnosis? Overfitting! Why? Because the number of samples (506) is of the same order of magnitude as the number of features (105, including the derived ones). To avoid overfitting in LR, we need nr. samples >> nr. features.
Solutions: Write code to create a wave dataset with 10 points. Display them in a plot. Note the behavior of the legend! When extending the Boston dataset with the engineered features, the nr. of features exploded from 13 to 104. Explain! (Hint: Pairs of features are multiplied.) Each feature combines with itself, as well as all others, but pairs of features are only combined once. We have 13 + 12 + + 2 + 1 = 14(14-1)/2 = 14 13 =/2 = 91 new features. 13 + 91 = 104. Calculate RMSE for the 3-point dataset and the line shown.
2. Ridge regression (RR) Same formula, but the coefficients are calculated by minimizing not only the RMSE, but also the sum of the squared coefficients, which is a measure of the complexity of the model: New error = EC is a new parameter (hyperparameter), called regularization parameter, that the user can set in order to control the model complexity: small alpha little regularization high complexity possible overfitting large alpha... default alpha: 1 The proof is in the pudding let us use RR on the same dataset extended_boston: Remember the previous scores from ordinary regression: Now let us play with alpha: Conclusion?
Plotting coef_ (slopes) for the three alphas above: Note that a large alpha leads to small coefficients (triangles) this means a less complex model.
What is the effect of the size of the training set? Score for RR training is always smaller than for OLS. Why? Up to about 380, OLS doesn t learn anything! RR does significantly better than OLS when dataset is small With enough data points, OLS eventually catches up on RR A little research: What exactly happens with the Linear Regression test scores below 380? I modified the mglearn code to print the actual values of the Linear Regression test scores. They are: Now is a good time to go back to the definition of R^2 and understand how it is possible for it to have such huge negative values. When R^2 is negative, it is recommended to use the simple average of the data points as an estimate!
3. Lasso regression (LaR) Ridge: Minimize Uses Euclidean norm, a.k.a. L2 norm. Lasso: Minimize Uses sum of absolute values, a.k.a. L1 norm. This Wikipedia pic explains why LaR sets some of the coefficients w to zero, rather than just making them smaller (as RR does). 1 For this reason, LaR leads to a simpler model than RR: it finds out redundant features, which do not contribute to explaining the data, and eliminates them. Another way to look at it: it selects and retains only the important features. Boston dataset: 1 Source: https://en.wikipedia.org/wiki/lasso_(statistics)#geometric_interpretation
See alpha = 1 above (blue squares) What is your diagnosis? Massive underfitting! Let us try to get closer to the optimum:
Note: The text says that Lasso with alpha = 0.01 does slightly better than Ridge with alpha = 0.1... but in the examples we see that they both have the test score 0.77. How can we find out if it is true? A: Simply include more digits when we display the result: Conclusion: Not true! (at least not on our random partition). Conclusions on linear regression algorithms: Use OLS only when the nr. of data points is >> nr. of features. Try to have at least 50 times more points than features to be safe from overfitting, e.g. for 2 features use at least 100 data points. (Of course, the distribution of those points is also important!) In general, RR is the first choice, unless...... the nr. of features is large (hundreds!) and we expect many of them to be irrelevant (but we don t know which ones). In this case, use LaR.