Chapter 1 STATISTICAL REGRESSION. H. Vereecken* and M. Herbst

Size: px

Start display at page:

Download "Chapter 1 STATISTICAL REGRESSION. H. Vereecken* and M. Herbst"

Shauna Beasley
5 years ago
Views:

1 3 Chapter 1 STATISTICAL REGRESSION H. Vereecken* and M. Herbst Institut Agrosphäre, ICG-IV, Forschungszentrum Jülich GmbH, Leo Brandt Strabe, Jülich, Germany p Corresponding author: Tel.: þ ; fax: þ Many of the available and well-established PTFs for the prediction of soil hydraulic properties from continuous soil properties are based on statistical regressions (Pachepsky et al., 1982; Cosby et al., 1984; Rawls and Brakensiek, 1985, 1989; Puckett et al., 1985; Vereecken et al., 1989, 1990; Wösten et al., 1997, 1999; Scheinost et al., 1997). Statistical regression is concerned with the analysis and construction of dependence structures between dependent (response) variables like parameters describing the moisture retention curve and independent (predictor) variables, e.g., bulk density or textural information. Depending upon the objectives, the process of regression will differ, but it is possible to construct a general modeling approach as was proposed by Draper and Smith (1966).They distinguish the following three phases: The first phase encompasses the planning stage. At this level, the problem is defined and the objectives are specified, the a priori knowledge is screened and eventually existing data gathered. It includes a preliminary data analysis. The second phase is the genuine model building with the development of the regression models and their testing against the objectives in an iterative way. The third phase is the validation of the obtained models, including the stability of parameters, prediction over the sample space and evaluation of the model adequacy. 1. OBJECTIVES OF STATISTICAL REGRESSIONS In general, three main objectives can be distinguished when using statistical regressions to model relations between two sets of variables: (1) prediction; (2) model specification; (3) parameter estimation. In using regression analysis for prediction purposes, the concern is mainly to obtain the best possible estimation of the response variable. Correct model specification and parameter accuracy are of secondary importance. In model specification, one is mainly interested in the relative importance of individual predictor variables on the predicted responses. This implies that all variables should be available in the database and that the DEVELOPMENTS IN SOIL SCIENCE q 2004 Elsevier B.V. VOLUME 30 ISSN /DOI /S (04) All rights reserved.

2 4 model contains the correct functional form of the predictor variables. Only then can the predictor variables be correctly assessed (Gunst and Mason, 1980). Application of regression methodology, in order to estimate parameters, requires that the model is correctly specified, the predictions are accurate and that the data allow a good estimation. Limitations of the database and the inability to measure all relevant predictors constrain the estimation of the parameters. The choice of the objective criterion or criteria to evaluate the developed model is determined by the objectives. The objective function (criterion) should be the quantitative expression of the modeling objective. A widely accepted objective criterion is the coefficient of determination ðr 2 Þ; which evaluates the performance of the model to explain the variation in the data. Most of the statistically based PTFs are either multiple linear regression equations or polynomials of n th order. Multiple linear regression is a common statistical tool used for the prediction of the response variable y from a number of n predictor variables x i. A multiple linear regression equation can be written as (Herbst and Diekkrüger, 2002): y ¼ a þ Xn i¼1 b i x i þ 1 i ¼ 1; ; n ð1þ with the constant a (intercept), the regression coefficients b i and the error 1. A nonlinear regression equation based on a second-order polynomial has the following form: y ¼ a þ Xn i¼1 ðb i x i þ c i x 2 i Þþ1 i ¼ 1; ; n: ð2þ where besides from the intercept a for every predictor variable x i ; two regression coefficients b i and c i have to be determined (Rawls and Brakensiek, 1985). 2. PRELIMINARY ANALYSIS OF SOIL DATA Different techniques are available to analyze the available soil data ranging from simple descriptive statistics (first and second moment of distribution, range) to graphical techniques (scatter plots) to multivariate statistical methods Simple data analysis Scatter plots, plotting different soil properties of interest against selected response variables, are extremely useful in detecting trends and extreme measurements The latter can also be done by means of biplots. One should be cautious, however, to delete at this stage so-called outliers, except when it is clear that these observations are incorrectly specified or measured. Scatter plots give information with regard to the linear or nonlinear behavior of variables and about the kind of transformation to be performed to eliminate nonlinearity. Transformations of the response variable are to be considered when different possible predictor variables show the same nonlinear behavior with respect to the response.

3 5 Figure 1. Scatter plots of (a) the saturated water content versus bulk density and (b) the log-transformed van Genuchten s parameter a versus sand content from the data set used by Vereecken et al. (1989). Figure 1 gives two examples of scatter plots (Vereecken, 1988). Figure 1a reveals the linear relationship between the saturated water content u s (response) and bulk density (predictor), while Figure 1b exhibits the positive correlation between the sand content and log-transformed van Genuchten s a (Van Genuchten, 1980). With statistical inferences and hypothesis testing in mind, it is interesting to examine the distributions of response variables. Examinations of sample distributions give information about transformations in order to obtain distributions more similar to the normal distribution, which is a precondition for the statistical regression techniques explained in Section 3. Frequently used numerical tests to evaluate the normality of a distribution are the Kolmogorov Smirnov and Shapiro Wilk statistics. The Kolmogorov Smirnov statistic is usually used for data sets including more than 50 observations, while in other cases the Shapiro Wilk statistic is used. The one-sample Kolmogorov Smirnov test calculates the D-value, which is the absolute maximum difference between the cumulative sample distribution and the cumulative distribution of a normal population. A two-sample Kolmogorov Smirnov is used to test whether two samples are drawn from the same population. Rather than measuring differences of means and variances of the populations, the Kolmogorov Smirnov statistic measures differences in shapes. The test statistic is a function of the sample size and can be either one- or two-tailed, testing the null hypothesis that the sample data are random samples from a normal distribution. Critical values for a specific level of significance can be looked up in special tables in order to decide whether or not the null hypothesis is to be rejected. For sample sets smaller than 50 observations the Shapiro Wilk statistics should be applied. This kind of statistics is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance. The value of W ranges from 0 to 1, with small values rejecting the null hypothesis that the sample is drawn from a normally distributed population.

4 6 Information about the shape of distribution can be obtained from the third and fourth moments of the distribution. The third moment measures the skewness of the distribution S: S ¼ n X n ðx i 2 XÞ 3 ðn 2 1Þðn 2 2Þ i¼1 s 3 where s is the standard deviation, X i is the observation value, X is the mean value and n is the number of observations. A population is said to be positive or rightly skewed, if the tail occurs in the smallest values. A negatively or left skewed population has the tail in the larger values (Figure 2). For normally distributed data, the skewness equals zero. ð3þ Figure 2. Histogram (bars) and cumulative relative frequency (solid line) of van Genuchten s a (cm 21 ) and the log-transformed a from the data set used by Vereecken et al. (1989), n ¼ 182 observations. The fourth moment of the distribution measures the extent of heaviness of the tails of a distribution. A heavily tailed distribution has a positive kurtosis, a flat distribution with short tail has a negative kurtosis: K ¼ nðn þ 1Þ X n ðn 2 1Þðn 2 2Þðn 2 3Þ i¼1 ðx i 2 XÞ 4 s 4 2 3ðn 2 1Þ 2 ðn 2 2Þðn 2 3Þ ð4þ The measure of kurtosis K is zero for a normally distributed population. An alternative to the numerical tests concerning the distribution of response variables is the use of graphical aids. Basically there are three types of data plots: (1) the stem and leaf plot; (2) the box or schematic plot; (3) the normal probability plot.

5 7 They often give a better overall view of the extent of non-normal behavior of the response variables. The examination of the distribution of variables is mainly based on the analysis of the plots. The example in Table 1 reveals the strong increase in W-value for the log-transformed a and n values compared to the original variables (see also Figure 2). The distribution of u r Table 1 The W-value, the skewness and kurtosis of the parameters of the moisture retention characteristic of the data set used by Vereecken et al. (1989) W-value Skewness Kurtosis u r u s a n ln(a) ln(n) Soil hydraulic properties are based on the Van Genuchten model (u 2 u r Þ=ðu s 2 u r Þ¼ ð1 þ lahl n Þ 2m with residual water content u r, saturated water content u s, inverse of the bubbling pressure a; shape parameters n and m and pressure head h (Van Genuchten, 1980) with m ¼ 1; N ¼ 182 observations. resembles the normal distribution to the least extent having the lowest W-value and high values of skewness and kurtosis. Various transformations, however, did not result in a much higher W-value. The high probabilities of rejecting the null hypothesis are likely to be a result of the sensitivity of the Wilks test towards observations belonging to the tails of distributions Multivariate methods The principal component analysis is a powerful tool in analyzing the structure of data matrices and the interdependence between variables. Success depends upon the existence of correlations among at least some of the original variables. New variables, called principal components, are formed that they are orthogonal to each other and uncorrelated. The first principal component explains the largest part of the variation. The second one explains a part of the remaining variation, and so on until all variations have been accounted for. In total, the number of components is the same as the total number of variables. Principal component analysis is mainly applied on the standardized data matrix X s or on the correlation matrix R, which can be written as: X d ¼ X 2 I X T m X s ¼ X d D 21=2 C ¼ XT d X d N 2 1 R ¼ D 21=2 CD 21=2 ð5þ

6 8 where X is the data matrix, D is the diagonal matrix of the variances, I is the unit column vector, X T m is the row vector of means of the different variables, X d the matrix of mean corrected scores, C is the variance covariance matrix and N is the number of observations. Choice between the use of the X s or R matrix depends upon the units the variables are measured in. Typically, when the variables have been measured in the same units, the variance covariance matrix is preferred. Even then, some authors (Jollife, 1986) prefer working with the correlation matrix because of the opportunity of direct comparison between the analysis results obtained from different sets of variables. Davis (1973), however, states that in geologic studies on granulometric composition of materials, where the relative magnitudes of variables are important, it is better to use the variance covariance matrix. Derivation of the principal components is based on the solution of the eigenvalue problem of the correlation matrix or the X s matrix subjected to different constraints. For the correlation matrix, this can be expressed mathematically as: ðr 2 l j IÞ X j ¼ 0 ð6þ with lr 2 l j Il ¼ 0 for non-zero X j vectors and X T j X j ¼ 1: In matrix notation this becomes: ðr 2 D e IÞU ¼ 0 ð7þ where the columns of U contain the eigenvectors X j of R and D e is a diagonal matrix containing the eigenvalues of R that are equal to the variances of the respective principal components, such that: R ¼ UD e U 21 ð8þ Because R is a symmetric matrix, U ¼ U 21 and the previous equation becomes: R ¼ UD e U ð9þ The orientation of the different principal component axes can be found from the columns of the rotation matrix U. The non-standardized principal component scores Z are calculated as: Z ¼ X s U ð10þ Standardizing these scores, so that each component explains the same amount of variability is done as follows: Z s ¼ X s UD e ð11þ Principal component analysis resulting in standardized scores is sometimes given the name principal factor analysis. The component loading matrix, containing the correlation

7 9 of the original variables with the principal components, is calculated as: 1 F ¼ UD 2 e ð12þ Reproduction of the original matrix of standardized scores can be obtained from: X s ¼ Z s F T ¼ Z s ðzs T X s Þ: ð13þ Using equation (13), it is possible to reconstruct the original variables in a plot of which the x- and the y-axes are given by the first and second principal factor and the factor scores. The length of these reconstructed original variables is a measure for the success of reconstruction. The position and length of the reconstructed original variables is a measure of their correlation. These graphs, called biplots, enable the user to analyze the relations existing between objects described in the axes generated by the principal components and the original variables. Figure 3 is a biplot of the principal component analysis carried out for the data set of Vereecken et al. (1989). This is a plot of the reconstructed original variables and the individual observations on the first two principal factors of the variables exhibiting, e.g., that the vectors representing the clay percentage and u r are very close to one another, confirming their positive correlation (see also Table 2). Values of ln(a), ln(n) and the sand fraction are positively correlated indicating that the larger the amount of sand in the soil, the higher ln(a) and ln(n) become. The values of u s and the bulk density, pointing in the opposite direction but laying in the same line, are strongly negatively correlated. Carbon content seems to be not strongly correlated with any of the parameters. The first two principal factors explain together 64% of the variability. 3. MODEL BUILDING The regression models are constructed in an iterative way. The most important step is the selection of variables to enter the equation. This can be done either using a priori knowledge and hypothetical reasoning or by trial and error using special regression techniques. Routinely applied techniques to select variables are (Gunst and Mason, 1980): (1) all possible regression equations; (2) backward elimination; (3) forward selection; (4) stepwise method; (5) stagewise method. Discussion on each of these techniques can be found in many statistics books. Whenever using these methods to construct regression models, care should be taken not to eliminate potential predictor variables. This is one of the reasons why it is important to check the mathematical relationship between the potential predictor variables with respect to the response in the data analysis (e.g., exponential or square root dependencies). Once an acceptable model is obtained, it is examined. General examination of a regression model incorporates the following topics: (1) verification of the error assumptions; (2) assessing goodness-of-fit of the equation; (3) examination of model misspecification;

8 10 Figure 3. Plot of the reconstructed original variables and the individual observations on the first two principal factors. The length of the reconstructed variables is tripled for clearness. logðaþ ¼log e ðaþ; logðnþ ¼log e ðnþ; BD ¼ dry bulk density, C% ¼ percent organic carbon. Data points are represented by the following symbols: U ¼ heavy clay, E ¼ clay, A ¼ clay silt loam, L ¼ sandy silt loam, P ¼ light sand loam, S ¼ loamy sand, Z ¼ sand according to the belgian textural classes (Verheye and Ameryckx, 1984). (4) determination of confidence intervals on estimated regression coefficients; (5) determination of confidence intervals on estimated response variables; (6) detection of outliers. A special problem related to the estimation of parameters and their confidence intervals is the multicollinearity. Multicollinearity is the problem of redundant information in the predictor set, and is defined as an approximate linear dependence between predictor variables. An extreme form of multicollinearity is exact linear dependence. Multicollinearity can be examined pairwise by means of a correlation analysis while multivariable collinearities can be detected by examination of the eigenvalues and eigenvectors of the X T s X s matrix (Section 2.2). Small eigenvalues reveal multicollinearity while the large scores of the variables for the corresponding eigenvectors identify the

9 Table 2 Correlation matrix of predictor and response variables and the corresponding significance levels of the data set used by Vereecken et al. (1989), soil hydraulic properties are based on the Van Genuchten model with m ¼ 1; n ¼ 182 observations u s u r log(a) log(n) Clay Sand Silt BD C% u s u r log(a) log(n) Clay Sand Silt BD C% Clay(,2 m), Sand( m), Silt(50 2 m), BD is the dry bulk density [g cm 23 ] and C% the carbon content. 11

10 12 variables involved. Multicollinearity has a deleterious effect on the least squares parameter estimation and regression methodology is very sensitive to this problem. Gunst and Mason (1980) sum up four different fields of the regression analysis that are adversely affected by multicollinearity: (1) the numerical values of estimated coefficients; (2) the variance covariance matrix (variance inflation); (3) test statistics; (4) predicted responses. Different strategies are available to handle multicollinearity. The first one is to delete one of the involved parameters. The difficulty is often to decide which parameter to delete, especially when a good prediction is important. Another possibility is to use biased regression estimators for multicollinearity through deletion of those eigenvectors defining multicollinearity. Principal component regression is based on the elimination of eigenvectors with small eigenvalues, resulting in a biased estimate. Another type of principal component regression is given by Freund and Littell (1986), where all predictor variables are transformed to principal components. Because they are all uncorrelated, no need exists to use some kind of variable selection technique. Eigenvalues regression is conceptually the same as the first type of principal component analysis, except that eigenvalues and eigenvectors are differently extracted and the criteria are different for deleting multicollinear variables. Yet one more alternative is the ridge regression where the effect of the eigenvectors defining multicollinearity is strongly reduced by addition of a small constant K to diagonal elements of the X T s X s matrix. The key problem of this alternative is to determine the value of this ridge parameter. A correlation matrix helps to choose the predictor variables that should be used for the regression. Table 2 shows a correlation matrix of the Van Genuchten parameters (response variables) with parameter m ¼ 1; textural information, bulk density and carbon content (predictor variables). A significant correlation among the response variables ln(a) and ln(n) can be found, which can be explained by the fact that soils having a clearly defined air entry value like coarsely textured soils and thus having a relative large ln(a) are in general characterized by a narrow pore size distribution represented by a large ln(n) value. The opposite is true for the finer textured soils. Among the predictor variables, a clear correlation exists, e.g., between the silt and the sand content, indicating that these parameters should not be combined in a regression equation, whereas the high correlation coefficient between u s and bulk density indicates the use of bulk density for the prediction of u s Model fit Evaluation of the goodness-of-fit of a model is based on the partitioning of the total sum of squares (TSS) representing the total variability in the database, expressed in an analysis of variance table. The TSS can be partitioned into three components: TSS ¼ SSM þ SSR þ SSE ð14þ where SSR is the sum of squares attributable to the model and TSS is equal to: TSS ¼ Y T Y ð15þ

11 13 where Y is the column vector representing the dependent variable and Y T is the transposed column vector of the dependent variable. SSM is the sum of squares attributable to the mean: SSM ¼ n 21 ði T YÞ 2 ð16þ where I T is the transposed unit column vector. SSE the residual sum of squares is written as: SSE ¼ e T e ð17þ where e is the error vector and e T is the transposed error vector. Parallel with the TSS the total degrees of freedom (d f ) can be partitioned as: SSM ¼ 1d f ; SSR ¼ n 2 p 2 1d f ; SSE ¼ pd f and TSS ¼ nd f : Most analysis packages give the corrected TSS in the form of an ANOVA table expressed as: CSS ¼ TSS 2 SSM ð18þ Dividing each of the right-hand sided terms by its degrees of freedom results in their respective mean squares. Ratios of mean squares, called F-ratios, can be used to test the hypothesis regarding the model parameters. An important F-test is: F r ¼ MSR MSE ð19þ F r is used to check whether the model is capable of explaining the variation in the data. The F-test is designed to perform simultaneous hypothesis tests on model parameters, while the t-test is used for individual hypothesis tests on the parameters. An important value to evaluate the goodness-of-fit is the coefficient of determination R 2. It is a measure for the amount of variability explained by the model. A possible way to calculate the term is: R 2 ¼ SSR SSR þ SSE ¼ SSR TSS 2 SSM ð20þ According to Kvålseth (1985), there is no unanimous agreement on the mathematical description of R 2 and at least eight different equations are in use. For some nonlinear models, certain expressions for the coefficient of determination yield values greater than one, as it is the case for the equation mentioned above. To account for the number of variables in the model, an adjusted coefficient of determination is used, which can be written as: R 2 adj ¼ 1 2 ð1 2 R 2 n 2 1 Þ n 2 m 2 1 where n is the number of observations and m is the number of variables. ð21þ

12 Poor model specification Two main reasons can cause the poor specification of a model: the omission of variables and/or the poor formulation of the functional expression of some or all variables in the model. Preliminary indications concerning the latter type or misspecification can be found from scatter plots, relating model variables (response, predictor) one to another (Figure 1). At the beginning of the statistical model building, necessary transformations of variables should be performed, otherwise important predictors can be lost. Once a model is built, misspecification of both types can be deduced from two types of residual plots: plots of residuals versus predictor variables (RVP plots) and partial residual plots (PR plots). Raw residuals, studentized residuals or standardized residuals can be used, but most frequently the plots involve the raw residuals. RVP plots provide information regarding the functional form of the predictor variables and the need for extra variables such as cross products or quadratic terms. PR plots give also information about the correct functional form of predictors, but assess in addition the nonlinearity in a predictor variable and the importance of a predictor variable in presence of others. These plots can only be used when the model is linear in the parameters. A numerical analytical technique to assess model inadequacy is the lack of fit test (Draper and Smith, 1966). This test can be performed if repeated measurements are available. The availability of repeated measurements enables to calculate the sum of squared errors due to pure error and to partition the total sum of squared errors into two components. SSE ¼ SSE p þ SSE l ; ð22þ where SSE p is the pure error component and SSE l the component due to lack of fit. The test whether this lack of fit is significant or not, is performed by the following ratio: F ¼ MSE l MSE p ð23þ which is an F-test, with MSE p equal to: MSE p ¼ X k X n i i¼1 j¼1 X k i¼1 ðy ij 2 Y i Þ 2 n i 2 k ð24þ where k is the number of different points in the prediction range where measurements have been made; n i the number of repeated observations in a given point of the prediction range. The MSE (mean squared error) is obtained from the regression analysis. The difference between the MSE and the MSE p gives the MSE due to lack of fit. The ratio given in equation (23) is then compared against the 100ð1 2 aþ% point of an F-distribution. If this ratio is not significant, then there is no reason to doubt about the adequacy of the model. In the opposite case, there is a considerable bias term and attempts should be made to discover where and how the inadequacy is generated. Model misspecification due to the

13 15 exclusion of important variables may not be detected with this technique, owing to the fact that both MSE p and MSE l are biased. In order to be able to use the lack of fit test, repeated measurements should be available. Although often no real repeated measurements (identical predictor value set) are available, the error component can be estimated by assuming that the hydraulic parameters for the same site and horizons are considered as repeated measurements Confidence intervals on estimated soil properties values Two types of intervals related to the response variables can be distinguished: (1) confidence interval on the expected responses; and (2) prediction interval for future responses. For the first case, the confidence interval can be written as: ^Y ^ t v; qffiffiffiffiffiffiffi s u T o Su 2a o ð25þ where u o is the column vector for a specific set of values of predictor variables, S the variance covariance matrix, s the mean squared residual, v is equal to the degrees of freedom ðn 2 p 2 1Þ and 1 2 1/2a is the confidence level, with a being the level of significance.the limits for the prediction interval for a future observation are: ^Y ^ t v; qffiffiffiffiffiffiffiffiffiffiffi ffi s 1 þ u T o Su 2a o ð26þ 3.4. Outlier detection The purpose of the detection of outliers is to identify observations having extremely large residuals, which do not fit in the pattern of the remaining data. Identification of outliers remains, even with the above definition, a very subjective and risky business and asks for careful examination. To detect outliers, the majority of plots and tests make use of studentized residuals. These studentized residuals behave more like standard normal deviates than either raw or standardized residuals, because they are divided by their standard error. Even then, raw and standardized residuals are still frequently used. The best is to combine all three types of residuals in an analysis. Commonly used graphical methods are scatter plots of the original response variables against predictor variables, relative frequency histograms of studentized residuals, plots of residuals against predictors or fitted responses and plots of residuals against deleted residuals detected as outliers. A deleted residual is defined as: r ð2iþ ¼ Y i 2 u T i b^ð2iþ ð27þ where u T i is the 1st row of the data matrix and b^ð2iþ the estimated regression coefficient vector for n-1 observations with the ith case removed. The distribution of the residuals is equal to that of a student t-test with N-p-2 degrees of freedom. Its value can be obtained from a transformation of the raw residuals.

14 16 Most of the statistical measures developed to detect outliers are based on the evaluation of the effect of the deletion of an observation on the estimated regression coefficient vector. These statistical tests are a measure for the distance between the estimated regression coefficient vector from the null observation set and the coefficient vector with one observation deleted. The concept is based on the multivariate confidence regions for the regression coefficient vector written as: ð b^ 2 bþ T X T Xð b^ 2 bþ ðp þ 1ÞMSE # F ð12aþ with ðp þ 1; n 2 p 2 1Þ degrees of freedom ð28þ A measure for the distance or closeness between ð b^ 2 bþ is: D i ¼ ð b^ 2 b^ð2iþþx 0 Xð b^ 2 bþ ðp þ 1ÞMSE ð29þ The equation above defines the size of the confidence region containing b^ð2iþ by finding the F 12aðpþ1;n2p21Þ value corresponding to the D 1 value and evaluating the ð1 2 aþ confidence level. Practically, b^ð2iþ should stay close to b: If that is not true, the observation should be rejected. Different simplifications have been introduced in the previous equation: D i ¼ h ii r 2 ð2iþ ðp þ 1ÞMSE or D i ¼ t2 =p þ 1 h ii =1 2 h ii ð30þ with h ii the diagonal elements of XðX T XÞ 21 X T ; measuring the influence of each observation, t i the ith studentized residual and r ð2iþ the deleted residual. Separate examination of t i ; h ii =ð1 2 h ii Þ and D i are useful whether or not an observation is to be deleted. The ratio h ii =ð1 2 h ii Þ denotes the influence that the associated response Y i has on the determination of b^. Even with all these possible techniques and graphs, one has to be careful in deleting observations because this often leads to forcing a linear relationship through data exhibiting a nonlinear behavior or showing interaction effects. This is especially true when not enough data are available over the complete range of interest. Using forced equations for prediction purposes can lead to serious mispredictions. 4. VALIDATION OF REGRESSION MODELS Validation of PTFs, developed by means of regression analysis is frequently overlooked. This is especially the case when variable selection techniques have been used to construct the model. These techniques display a penchant towards capitalizing on chance variation (Green and Caroll, 1978) for specific data sets, introducing too many independent variables. Absence of any validation is mainly due to the lack of additional data to perform the validation on. Basically, the regression models can be validated in two ways. The model can be statistically evaluated, e.g., with the estimated regression coefficients or by crossvalidation. In addition, models can be practically evaluated towards the further use of the

15 17 estimated responses as input in other models. Only the statistical validation will be considered here. A commonly applied method is the double cross-validation as designed by Green and Caroll (1978). The advantages of this method are that no additional data are needed and its simplicity. The method is designed to evaluate the stability of the estimated regression coefficients and the prediction level of the equation. In a first step, the complete set of observations is randomly split in two halves. On each of the halves, a regression analysis is performed using the variables retained on the complete set of observations, as found in the modeling procedure. Then the regression model obtained from the first half is applied to the second half and vice versa. For each case, the simple correlation between observed and estimated values is calculated. Table 3 gives the results of a stepwise regression analysis to predict u s from bulk density and clay content (Vereecken et al., 1989). This regression model is adequate, because the Table 3 Results of the stepwise regression analysis and cross validation for the water content at saturation u s (response variable). u s Retained variables and estimated regression coefficients at a 5% Partial R 2 adj Model R 2 adj F-value for lack of fit F-value at 95% BD þ BD þ (þ) BD þ (þ) BD is the dry bulk density (g cm 23 ), Clay the clay content [%] and (þ) double crossvalidation. F-value of the lack of fit test is clearly smaller than the critical F-value at 95%. The double cross-validation procedure shows a maximum coefficient of determination of 87%. A comparable procedure to the technique described above is the jackknife method (Pachepsky and Rawls, 2003), where a secondary data set is necessary. It is used to validate the model developed from the prediction data set by predicting the values of the independent data set. For both cross-validation approaches the mean absolute error, mean square error, root mean square error and graphical plots of predicted versus measured values are used to assess the quality of the predictions (Müller et al., 2001). 5. SUMMARY We described the use of statistical regression techniques to develop pedotransfer functions (PTFs) estimating hydraulic properties from basic soil properties. In this section, PTFs are considered as regression models with soil data as predictor variables and hydraulic properties as response variables. Three basic steps are presented that are usually applied in an iterative procedure: Analysis of the soil data, the model building step and the

16 18 model validation. Different methods are presented on how to analyze the soil data. They range from simple scatter plots providing, e.g., information on the type of relationship between variables and the existence of outliers to multivariate statistical analyses allowing to examine the soil data in a holistic manner. Analysis of the statistical distribution of predictor and response variables may provide important information for the modelbuilding step. Extremely useful is the application of principal component analysis to examine the linear dependence between variables. Due to inherent correlation between predictor variables used in PTFs, care needs to be taken to avoid the problem of multicollinearity. This can be avoided by transforming predictor variables to independent variables using principal component analysis. The second step in developing PTF consists in the building of the regression model. Available methods include, e.g., backward and forward regression techniques, stepwise and stagewise methods. Once a first acceptable model is identified, six basic topics need to be checked: verification of the error assumption, goodness-of-fit of the model, identification of model misspecification, examination of confidence intervals on estimated regression coefficients and response variables and finally outlier detection. The model-building step is an iterative procedure requiring many iteration steps to find the best PTF model. The last step consists in the validation of the regression model or PTF. This step is usually overlooked but it is essential in establishing confidence in the developed model. Two basic but completely different methods are available: functional validation of the models and statistical validation. Functional validation aims at examining the variability of the outcome of simulation model (e.g., water balance, solute transport in soils) for a specific application caused by the uncertainty in PTF. In this chapter, we focused on statistical validation aiming at checking the validity of the prediction level (e.g., coefficient of determination) and the stability of the estimated regression coefficients using the double cross-validation technique. REFERENCES Cosby, B.J., Hornberger, G.M., Clapp, R.B., Ginn, T.R., A statistical exploration of the relationships of soil moisture characteristics to the physical properties of soil. Water Resour. Res. 20, Davis, J.C., Statistics and Data Analysis in Geology. John Wiley and Sons, New York. Draper, N.R., Smith, H., Applied Regression Analysis. John Wiley, New York. Freund, R.J., Littell, R., SAS System for Linear Models. SAS Institute Inc., Cary, NC. Green, P.E., Caroll, J.D., Analyzing Multivariate Data. John Wiley, New York. Gunst, F.R., Mason, R.L., Regression Analysis and Its Applications. A Data Oriented Approach. Marcel Dekker Inc., New York. Herbst, M., Diekkrüger, B., The influence of the spatial structure of soil properties on water balance modeling in a microscale catchment. Physics and Chemistry of the Earth, Part B 27, Jollife, I.T., Principal Component Analysis. Springer Verlag, New York. Kvålseth, T.O., Cautionary note about R 2. The American Statistician 39,

17 Müller, T.G., Pierce, F.J., Schabenberger, O., Warncke, D.D., Map quality for sitespecific management. Soil Sci. Soc. Am. J. 65, Pachepsky, Y., Rawls, W.J., Soil structure and pedotransfer functions. Eur. J. Soil Sci. 54, Pachepsky, Y., Shcherbakov, R.A., Varallyay, G., Raijkai, K., Statistical analysis of water retention relations with other physical properties of soils. Pochvovedenie 2, 42-52, (in Russian, English Abstr.). Puckett, W.E., Dane, J.H., Hajek, B.F., Physical and mineralogical data to determine soil hydraulic properties. Soil Sci. Soc. Am. J. 49, Rawls, W.J., Brakensiek, D.L., Prediction of soil water properties for hydrologic modelling. In: Jones, E.B., Ward, T.J. (Eds.), Proceedings of the Symposium of Watershed Management in the Eighties. April 30 May 1, 1985, Denver, CO. Am. Soc. Civil Engng, New York, NY, pp Rawls, W.J., Brakensiek, D.L., Estimation of soil water retention and hydraulic properties. In: Morel-Seytoux, H.J. (Ed.), Unsaturated Flow in Hydrological Modeling: Theory and Practice. Kluwer Academic Publishers, Dordrecht, pp Scheinost, A.C., Sinowski, W., Auerswald, K., Regionalization of soil water retention curves in a highly variable soilscape. I. Developing a new pedotransfer function. Geoderma 78, Van Genuchten, M.T., A closed form equation for predicting the hydraulic conductivity of unsaturated soils. Soil Sci. Soc. Am. J. 44, Vereecken, H., Pedotransfer functions for the generation of hydraulic properties for belgian soils. Thesis, Doctoraatsproefschrift Nr. 171 aan de Fakulteit der Landbouwwetenschappen van de Katholieke Universiteit te Leuven, pp Vereecken, H., Feyen, J., Maes, J., Darius, P., Estimating the soil moisture retention characteristic from texture, bulk density and carbon content. Soil Sci. 148, Vereecken, H., Maes, J., Feyen, J., Estimating unsaturated hydraulic conductivity from easily measured soil properties. Soil Sci. 149, Verheye, W., Ameryckx, J., Mineral fractions and classification of soil texture. Pedologie 2, Wösten, J.H.M., Pedotransfer functions to evaluate soil quality. In: Gregorich, E.G., Carter, M.R. (Eds.), Soil Quality for Crop Production and Ecosystem Health. Developments in Soils Science, Vol. 25. Elsevier, Amsterdam, pp Wösten, J.H.M., Lilly, A., Nemes, A., Le Bas, C., Development and use of a database of hydraulic properties of European soils. Geoderma 90,

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test