On prediction. Jussi Hakanen Post-doctoral researcher. TIES445 Data mining (guest lecture)

On prediction Jussi Hakanen Post-doctoral researcher jussi.hakanen@jyu.fi

Learning outcomes To understand the basic principles of prediction To understand linear regression in prediction To be aware of the connection between the least squares and optimization

Exercise Find out issues related to prediction in data mining Work in pairs or small groups Time: 10 minutes TIES445 Data mining (guest lecture)

Summary of the exercise How to measure quality of prediction? How to avoid overlearning/overfitting? Supervised learning How to predict for missing data? Different applications (stock markets, biology, sport prediction, income in finance, energy water consumption, ) Underfitting Bias of the model Concept drift (inaccuracy of the prediction) Obtaining new knowledge from a given data TIES445 Data mining (guest lecture)

Motivation How to estimate data Within the ranges of the dataset? Outside of the dataset? Handling missing data, outliers Predictive vs. descriptive models Prediction for numerical values (cf. classification)

Concepts Predictor (or independent or input) variables x = x 1,, x N T (N 1) Response (or dependent or output) variables y = y 1,, y M T (M 1) Regression model/function A model/function describing the prediction used Linear regression Regression model/function is linear

Prediction A set of data for which the values of predictor and response variables are known P data points x j, y j, j = 1,, P x j = x 1 j,, x N j T y j = y 1 j,, y M j T Idea is to use prediction models for predicting a value for such predictor variables for which we don t know the response Note: P should be greater or equal than N! Interpolation vs. extrapolation Can give misleading results if not interpreted carefully!!! Accuracy important Different measures of accuracy Can be used e.g. to choose between different models and/or for choosing values for different parameters in the models Can sometimes be sacrificed for a simpler model

Regression analysis Used for prediction and forecasting Parametric/non-parametric regression Regression function depends of a finite number of unknown parameters Non-parametric: regression function is in a set of functions (can be infinite dimensional) Linear and nonlinear regression W.r.t to the parameters

Linear regression Model linear w.r.t. parameters Not necessarily linear w.r.t. predictor variables N i=1 y = a 0 + a i x i y is a predicted estimate of the mean value at x a 0,, a N are parameters Oldest and most widely used due to simplicity Typically, the model used is not exact an error exists y j = y j + e j for each data point x j, j = 1,, P In matrix terms: y = Xa + e How to select values for the parameters a?

Least squares Determining orbits of bodies around the Sun from astronomical observations (Legendre, 1805; Gauss, 1809) Idea: minimize the sum of the squared errors For problems with P > N P P j=1 = y j N j j=1 i=0 a i x i e j 2 P min y j N j a i x a j=1 i=0 i Optimization problem! 2 Parameter values minimizing the above can be shown to be a = X T X 1 X T y Direct solution requires X T X to be invertible (problems if P is small or there are linear dependences between x i ) Typically a is computed by numerical linear algebra 2

Example y = 0.6777 + 3.0166x

Example (cont.) y = 4.1579 0.0057x + 0.3053x 2

Notes The parameter values in linear regression can be interpreted as follows: If the value of predictor variable x i is increased by one unit and the values of other predictor variables remain the same, then a i denotes the change in prediction N y = a 0 + i=1 a i x i TIES445 Data mining (guest lecture)

Connection to optimization Least squares unconstrained optimization problem min 1 a 2 (f j (a)) 2 P j=1 = min 1 a 2 f a 2 Function f j (a) = y j h(a, x j ) where h(a, x) is the model used N E.g. h a, x = a 0 + a i x i i=1 Gauss-Newton method Taylor (1st order): f a, a h f a h + f a h T (a a h ) a h+1 = a h f a h f a h T 1 f a h f(a h ) Connection to Newton s method Hessian of 1 f a 2 : f a h f a h T P + 2 f 2 j=1 j a h f j (a h ) Gauss-Newton is equivalent with Newton except the second order term!

Function approximation Prediction can be used in optimization for approximating the objective function Typically used when the evaluation of the objective function is time consuming E.g. if the model is a partial differential equation that takes significant amount of time to solve numerically Reduces time for optimization since typically a large amount of function evaluations are required Examples of approximation models are polynomial approximation, radial basis functions (RBFs), Kriging, support vector regression

Regularization Previously, no requirements were made for the parameter values Unconstraint optimization problem E.g. need for constraining the size of parameters Tikhonov regularization (ridge regression) Add a constraint that a 2, the L 2 norm of the parameter vector is not greater than a given value Can be considered as unconstraint optimization problem by adding a penalty term β a 2 to the objective function Lasso method (least absolute shrinkage and selection operator) Add a constraint that a 1, the L 1 norm of the parameter vector is not greater than a given value Prefers solutions with fewer non-zeros

Conclusions What were the keypoints from your perspective? What do you remember best? Extrapolation and interpolation can be dangerous Regularization is important?

Thank You! Dr. Jussi Hakanen Industrial Optimization Group http://www.mit.jyu.fi/optgroup/ Department of Mathematical Information Technology P.O. Box 35 (Agora) FI-40014 University of Jyväskylä jussi.hakanen@jyu.fi http://users.jyu.fi/~jhaka/en/