Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Size: px

Start display at page:

Download "Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression"

Megan Nash
5 years ago
Views:

1 Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression

2 Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction error Reduction of dimensionality

3 n observations Predictive Modelling Terminology Given input data {(x 1,y 1 ),, (x n,y n )} 1. find model of relationship between Y and X 1,X 2,,X d 2. estimate predictive performance of the model for new data X 1, X 2,..., X d Y X i input variable (other names: independent variable, predictor, regressor, explanatory variable, carrier, factor, covariate) Y target variable (also response)...? Model...

4 Linear Methods for Regression We seek Y(X) assuming that the relationship is linear (assumption simplifies computations to fit the model) Many nonlinear problem can be modelled with linear regression by applying transformations to variables We will discuss the following problems Fitting the model to data Verifying goodness-of-fit Should the model include all the features X 1 through X d, or only best features? (especially important for high-dimensionality data)

5 Linear Methods for Classification Regression can also used for classification logistic regression Linearity assumption also important in classification (e.g. Linear Discriminant Analysis (LDA), separating hyperplanes (perceptron algorithm))

6 Theoretical Background Statistical Decision Theory Notation X R d input variables (random variables) Y R output variable (random variable) Pr(X,Y) joint probability distribution We look for a function f(x) for predicting the value of Y

7 Theoretical Background Statistical Decision Theory Criterion: the function should minimize the squared error: Solution regression function:

8 Theoretical Background Statistical Decision Theory Notice: if criterion is to minimize the L 1 norm then the solution is

9 From Theory to Practice Regression function f(x)=e(y X=x) is based on known joint probability distribution of X and Y How to estimate f from data: {(x 1,y 1 ),, (x n,y n )}? Different approaches: Parametric build a model of f(x) Nonparametric

10 Linear Regression We assume that f(x)=e(y X=x) is a linear function of X 1, X 2,..., X d : - vector of unknown model coefficients As X j we can take: quantitative input variables nonlinear transformations of inputs (e.g., log) polynomial terms, e.g., X 2 =X 12, etc. numerically coded levels of qulitative variable

11 Fitting the Model Estimation of =[ 0, 1,, d ] based on {(x 1,y 1 ),, (x n,y n )} by minimization of residual sum of squares RSS( ): Solution:

12 Verifying the Model Goodness of fit must be verified before we attempt to use the model for prediction How to check if the model fits the data well? Regression procedures in software packages (SAS, SPSS, etc.) offer several methods / tools to do this: diagnostic plots, hypothesis tests if parameters are significant, measures of residual error, etc. we illustrate these by example

13 Linear Regression Example How weight of children depends on height and age? Solution using SAS procedure reg: proc reg data=kids; model weight=height age; plot weight*age weight*height; plot r.*p.; run; Fit model w = f(h,a) Diagnostic plots for model verification

14 Linear Regression Example Solution: weight = x height x age Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept <.0001 height <.0001 age <.0001

15 Linear Regression Example Overall measures concering the fit: Root MSE Mean Square Error in regression R-Square regression accounts for 63% of variance in data is explained by the regression model ( 1 implies that the model is appropriate) Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var

16 SStotal SS SS Linear Regression Example n 2 f ( x j y reg ) j 1 n n 2 y j y j 1 RSS 2 y j f ( x j err ) j 1 R 2 1 SS SS err total SS total SS err

17 Linear Regression Example Testing if parameters of the model are significant t Value test of hypothesis H0: j = 0 p Value < 0.05 reject H0 model parameters are significant Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept <.0001 height <.0001 age <.0001

18 Linear Regression Example Diagnostic plots visually verify linearity assumption weight vs age

19 Linear Regression Example Diagnostic plots visually verify linearity assumption weight vs height

20 Linear Regression Example Diagnostic plots Residual vs Predicted values Residual = y observed -y predicted Trend in shape may indicate model inadequacy

21 Linear Regression Finding Best Regressors (Features) Problem: regression based on all features X or only best features should be found? proc reg data=kids; run; model weight=height age / selection=forward; model weight=height age / selection=backward; model weight=height age / selection=stepwise; Forward subsequent parameters added to model Backward starting with complete model, parameters are removed Stepwise similar to forward (but parameter can be removed after being added to the model)

22 Linear Regression Finding Best Regressors Forward Selection: Step 1 Variable height Entered: R-Square = and C(p) = Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model <.0001 Error Test H0 that smaller model is appropriate Corrected Total Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept <.0001 height <.0001

23 Linear Regression Finding Best Regressors Forward Selection: Step 2 Variable age Entered: R-Square = and C(p) = Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model <.0001 Error Corrected Total Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept <.0001 height <.0001 age <.0001

24 2 4 Feature selection with regularization (Lasso, elastic net) Methods inspired by ridge regression (modified regression for high dimensional data) Ridge regression: we fit model minimizing RSS, while limiting the size of coefficients : Constraints on L2 norm

25 2 5 Ridge regression proportionally reduce parameters 1,, d does not lead to selection of features Lasso similar to ridge regression, with constrain on formulated as L 1 norm: Lasso leads to feature selection some coefficients 1,, d are reduced to 0.

26 2 6 Modification of (2D example 1, 2 ) (ref. to Hastie, Tibshirani, Friedman)

27 2 7 Modification of (2D example 1, 2 ) (ref. to Hastie, Tibshirani, Friedman)

28 2 8 Elastic net Similar to Lasso, with constraints on parameters expressed as weighted L 1 and L 2 norms: Lasso vs elastic net: Consider a subset of related features: lasso tends to select one feature from the subset; elastic net tends to select the whole subset (whole subset out or in).

29 2 9 Ridge regression vs Lasso vs E-net Shapes of constraint regions for

30 Regression vs nonparametric approach Linear regression f(x)=e(y X=x) assumption: linear function of X 1,,X d Nearest neighbours method N k (x) neighbourhood of x containing k points closest to x Attempt to estimate f(x)=e(y X=x) directly from data Theorem. If n,k, k/n 0, then f^(x) E(Y X=x)

Statistics 262: Intermediate Biostatistics Model selection

Statistics 262: Intermediate Biostatistics Model selection Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Today s class Model selection. Strategies for model selection.