Chapter 1 STATISTICAL REGRESSION. H. Vereecken* and M. Herbst

Size: px
Start display at page:

Download "Chapter 1 STATISTICAL REGRESSION. H. Vereecken* and M. Herbst"

Transcription

1 3 Chapter 1 STATISTICAL REGRESSION H. Vereecken* and M. Herbst Institut Agrosphäre, ICG-IV, Forschungszentrum Jülich GmbH, Leo Brandt Strabe, Jülich, Germany p Corresponding author: Tel.: þ ; fax: þ Many of the available and well-established PTFs for the prediction of soil hydraulic properties from continuous soil properties are based on statistical regressions (Pachepsky et al., 1982; Cosby et al., 1984; Rawls and Brakensiek, 1985, 1989; Puckett et al., 1985; Vereecken et al., 1989, 1990; Wösten et al., 1997, 1999; Scheinost et al., 1997). Statistical regression is concerned with the analysis and construction of dependence structures between dependent (response) variables like parameters describing the moisture retention curve and independent (predictor) variables, e.g., bulk density or textural information. Depending upon the objectives, the process of regression will differ, but it is possible to construct a general modeling approach as was proposed by Draper and Smith (1966).They distinguish the following three phases: The first phase encompasses the planning stage. At this level, the problem is defined and the objectives are specified, the a priori knowledge is screened and eventually existing data gathered. It includes a preliminary data analysis. The second phase is the genuine model building with the development of the regression models and their testing against the objectives in an iterative way. The third phase is the validation of the obtained models, including the stability of parameters, prediction over the sample space and evaluation of the model adequacy. 1. OBJECTIVES OF STATISTICAL REGRESSIONS In general, three main objectives can be distinguished when using statistical regressions to model relations between two sets of variables: (1) prediction; (2) model specification; (3) parameter estimation. In using regression analysis for prediction purposes, the concern is mainly to obtain the best possible estimation of the response variable. Correct model specification and parameter accuracy are of secondary importance. In model specification, one is mainly interested in the relative importance of individual predictor variables on the predicted responses. This implies that all variables should be available in the database and that the DEVELOPMENTS IN SOIL SCIENCE q 2004 Elsevier B.V. VOLUME 30 ISSN /DOI /S (04) All rights reserved.

2 4 model contains the correct functional form of the predictor variables. Only then can the predictor variables be correctly assessed (Gunst and Mason, 1980). Application of regression methodology, in order to estimate parameters, requires that the model is correctly specified, the predictions are accurate and that the data allow a good estimation. Limitations of the database and the inability to measure all relevant predictors constrain the estimation of the parameters. The choice of the objective criterion or criteria to evaluate the developed model is determined by the objectives. The objective function (criterion) should be the quantitative expression of the modeling objective. A widely accepted objective criterion is the coefficient of determination ðr 2 Þ; which evaluates the performance of the model to explain the variation in the data. Most of the statistically based PTFs are either multiple linear regression equations or polynomials of n th order. Multiple linear regression is a common statistical tool used for the prediction of the response variable y from a number of n predictor variables x i. A multiple linear regression equation can be written as (Herbst and Diekkrüger, 2002): y ¼ a þ Xn i¼1 b i x i þ 1 i ¼ 1; ; n ð1þ with the constant a (intercept), the regression coefficients b i and the error 1. A nonlinear regression equation based on a second-order polynomial has the following form: y ¼ a þ Xn i¼1 ðb i x i þ c i x 2 i Þþ1 i ¼ 1; ; n: ð2þ where besides from the intercept a for every predictor variable x i ; two regression coefficients b i and c i have to be determined (Rawls and Brakensiek, 1985). 2. PRELIMINARY ANALYSIS OF SOIL DATA Different techniques are available to analyze the available soil data ranging from simple descriptive statistics (first and second moment of distribution, range) to graphical techniques (scatter plots) to multivariate statistical methods Simple data analysis Scatter plots, plotting different soil properties of interest against selected response variables, are extremely useful in detecting trends and extreme measurements The latter can also be done by means of biplots. One should be cautious, however, to delete at this stage so-called outliers, except when it is clear that these observations are incorrectly specified or measured. Scatter plots give information with regard to the linear or nonlinear behavior of variables and about the kind of transformation to be performed to eliminate nonlinearity. Transformations of the response variable are to be considered when different possible predictor variables show the same nonlinear behavior with respect to the response.

3 5 Figure 1. Scatter plots of (a) the saturated water content versus bulk density and (b) the log-transformed van Genuchten s parameter a versus sand content from the data set used by Vereecken et al. (1989). Figure 1 gives two examples of scatter plots (Vereecken, 1988). Figure 1a reveals the linear relationship between the saturated water content u s (response) and bulk density (predictor), while Figure 1b exhibits the positive correlation between the sand content and log-transformed van Genuchten s a (Van Genuchten, 1980). With statistical inferences and hypothesis testing in mind, it is interesting to examine the distributions of response variables. Examinations of sample distributions give information about transformations in order to obtain distributions more similar to the normal distribution, which is a precondition for the statistical regression techniques explained in Section 3. Frequently used numerical tests to evaluate the normality of a distribution are the Kolmogorov Smirnov and Shapiro Wilk statistics. The Kolmogorov Smirnov statistic is usually used for data sets including more than 50 observations, while in other cases the Shapiro Wilk statistic is used. The one-sample Kolmogorov Smirnov test calculates the D-value, which is the absolute maximum difference between the cumulative sample distribution and the cumulative distribution of a normal population. A two-sample Kolmogorov Smirnov is used to test whether two samples are drawn from the same population. Rather than measuring differences of means and variances of the populations, the Kolmogorov Smirnov statistic measures differences in shapes. The test statistic is a function of the sample size and can be either one- or two-tailed, testing the null hypothesis that the sample data are random samples from a normal distribution. Critical values for a specific level of significance can be looked up in special tables in order to decide whether or not the null hypothesis is to be rejected. For sample sets smaller than 50 observations the Shapiro Wilk statistics should be applied. This kind of statistics is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance. The value of W ranges from 0 to 1, with small values rejecting the null hypothesis that the sample is drawn from a normally distributed population.

4 6 Information about the shape of distribution can be obtained from the third and fourth moments of the distribution. The third moment measures the skewness of the distribution S: S ¼ n X n ðx i 2 XÞ 3 ðn 2 1Þðn 2 2Þ i¼1 s 3 where s is the standard deviation, X i is the observation value, X is the mean value and n is the number of observations. A population is said to be positive or rightly skewed, if the tail occurs in the smallest values. A negatively or left skewed population has the tail in the larger values (Figure 2). For normally distributed data, the skewness equals zero. ð3þ Figure 2. Histogram (bars) and cumulative relative frequency (solid line) of van Genuchten s a (cm 21 ) and the log-transformed a from the data set used by Vereecken et al. (1989), n ¼ 182 observations. The fourth moment of the distribution measures the extent of heaviness of the tails of a distribution. A heavily tailed distribution has a positive kurtosis, a flat distribution with short tail has a negative kurtosis: K ¼ nðn þ 1Þ X n ðn 2 1Þðn 2 2Þðn 2 3Þ i¼1 ðx i 2 XÞ 4 s 4 2 3ðn 2 1Þ 2 ðn 2 2Þðn 2 3Þ ð4þ The measure of kurtosis K is zero for a normally distributed population. An alternative to the numerical tests concerning the distribution of response variables is the use of graphical aids. Basically there are three types of data plots: (1) the stem and leaf plot; (2) the box or schematic plot; (3) the normal probability plot.

5 7 They often give a better overall view of the extent of non-normal behavior of the response variables. The examination of the distribution of variables is mainly based on the analysis of the plots. The example in Table 1 reveals the strong increase in W-value for the log-transformed a and n values compared to the original variables (see also Figure 2). The distribution of u r Table 1 The W-value, the skewness and kurtosis of the parameters of the moisture retention characteristic of the data set used by Vereecken et al. (1989) W-value Skewness Kurtosis u r u s a n ln(a) ln(n) Soil hydraulic properties are based on the Van Genuchten model (u 2 u r Þ=ðu s 2 u r Þ¼ ð1 þ lahl n Þ 2m with residual water content u r, saturated water content u s, inverse of the bubbling pressure a; shape parameters n and m and pressure head h (Van Genuchten, 1980) with m ¼ 1; N ¼ 182 observations. resembles the normal distribution to the least extent having the lowest W-value and high values of skewness and kurtosis. Various transformations, however, did not result in a much higher W-value. The high probabilities of rejecting the null hypothesis are likely to be a result of the sensitivity of the Wilks test towards observations belonging to the tails of distributions Multivariate methods The principal component analysis is a powerful tool in analyzing the structure of data matrices and the interdependence between variables. Success depends upon the existence of correlations among at least some of the original variables. New variables, called principal components, are formed that they are orthogonal to each other and uncorrelated. The first principal component explains the largest part of the variation. The second one explains a part of the remaining variation, and so on until all variations have been accounted for. In total, the number of components is the same as the total number of variables. Principal component analysis is mainly applied on the standardized data matrix X s or on the correlation matrix R, which can be written as: X d ¼ X 2 I X T m X s ¼ X d D 21=2 C ¼ XT d X d N 2 1 R ¼ D 21=2 CD 21=2 ð5þ

6 8 where X is the data matrix, D is the diagonal matrix of the variances, I is the unit column vector, X T m is the row vector of means of the different variables, X d the matrix of mean corrected scores, C is the variance covariance matrix and N is the number of observations. Choice between the use of the X s or R matrix depends upon the units the variables are measured in. Typically, when the variables have been measured in the same units, the variance covariance matrix is preferred. Even then, some authors (Jollife, 1986) prefer working with the correlation matrix because of the opportunity of direct comparison between the analysis results obtained from different sets of variables. Davis (1973), however, states that in geologic studies on granulometric composition of materials, where the relative magnitudes of variables are important, it is better to use the variance covariance matrix. Derivation of the principal components is based on the solution of the eigenvalue problem of the correlation matrix or the X s matrix subjected to different constraints. For the correlation matrix, this can be expressed mathematically as: ðr 2 l j IÞ X j ¼ 0 ð6þ with lr 2 l j Il ¼ 0 for non-zero X j vectors and X T j X j ¼ 1: In matrix notation this becomes: ðr 2 D e IÞU ¼ 0 ð7þ where the columns of U contain the eigenvectors X j of R and D e is a diagonal matrix containing the eigenvalues of R that are equal to the variances of the respective principal components, such that: R ¼ UD e U 21 ð8þ Because R is a symmetric matrix, U ¼ U 21 and the previous equation becomes: R ¼ UD e U ð9þ The orientation of the different principal component axes can be found from the columns of the rotation matrix U. The non-standardized principal component scores Z are calculated as: Z ¼ X s U ð10þ Standardizing these scores, so that each component explains the same amount of variability is done as follows: Z s ¼ X s UD e ð11þ Principal component analysis resulting in standardized scores is sometimes given the name principal factor analysis. The component loading matrix, containing the correlation

7 9 of the original variables with the principal components, is calculated as: 1 F ¼ UD 2 e ð12þ Reproduction of the original matrix of standardized scores can be obtained from: X s ¼ Z s F T ¼ Z s ðzs T X s Þ: ð13þ Using equation (13), it is possible to reconstruct the original variables in a plot of which the x- and the y-axes are given by the first and second principal factor and the factor scores. The length of these reconstructed original variables is a measure for the success of reconstruction. The position and length of the reconstructed original variables is a measure of their correlation. These graphs, called biplots, enable the user to analyze the relations existing between objects described in the axes generated by the principal components and the original variables. Figure 3 is a biplot of the principal component analysis carried out for the data set of Vereecken et al. (1989). This is a plot of the reconstructed original variables and the individual observations on the first two principal factors of the variables exhibiting, e.g., that the vectors representing the clay percentage and u r are very close to one another, confirming their positive correlation (see also Table 2). Values of ln(a), ln(n) and the sand fraction are positively correlated indicating that the larger the amount of sand in the soil, the higher ln(a) and ln(n) become. The values of u s and the bulk density, pointing in the opposite direction but laying in the same line, are strongly negatively correlated. Carbon content seems to be not strongly correlated with any of the parameters. The first two principal factors explain together 64% of the variability. 3. MODEL BUILDING The regression models are constructed in an iterative way. The most important step is the selection of variables to enter the equation. This can be done either using a priori knowledge and hypothetical reasoning or by trial and error using special regression techniques. Routinely applied techniques to select variables are (Gunst and Mason, 1980): (1) all possible regression equations; (2) backward elimination; (3) forward selection; (4) stepwise method; (5) stagewise method. Discussion on each of these techniques can be found in many statistics books. Whenever using these methods to construct regression models, care should be taken not to eliminate potential predictor variables. This is one of the reasons why it is important to check the mathematical relationship between the potential predictor variables with respect to the response in the data analysis (e.g., exponential or square root dependencies). Once an acceptable model is obtained, it is examined. General examination of a regression model incorporates the following topics: (1) verification of the error assumptions; (2) assessing goodness-of-fit of the equation; (3) examination of model misspecification;

8 10 Figure 3. Plot of the reconstructed original variables and the individual observations on the first two principal factors. The length of the reconstructed variables is tripled for clearness. logðaþ ¼log e ðaþ; logðnþ ¼log e ðnþ; BD ¼ dry bulk density, C% ¼ percent organic carbon. Data points are represented by the following symbols: U ¼ heavy clay, E ¼ clay, A ¼ clay silt loam, L ¼ sandy silt loam, P ¼ light sand loam, S ¼ loamy sand, Z ¼ sand according to the belgian textural classes (Verheye and Ameryckx, 1984). (4) determination of confidence intervals on estimated regression coefficients; (5) determination of confidence intervals on estimated response variables; (6) detection of outliers. A special problem related to the estimation of parameters and their confidence intervals is the multicollinearity. Multicollinearity is the problem of redundant information in the predictor set, and is defined as an approximate linear dependence between predictor variables. An extreme form of multicollinearity is exact linear dependence. Multicollinearity can be examined pairwise by means of a correlation analysis while multivariable collinearities can be detected by examination of the eigenvalues and eigenvectors of the X T s X s matrix (Section 2.2). Small eigenvalues reveal multicollinearity while the large scores of the variables for the corresponding eigenvectors identify the

9 Table 2 Correlation matrix of predictor and response variables and the corresponding significance levels of the data set used by Vereecken et al. (1989), soil hydraulic properties are based on the Van Genuchten model with m ¼ 1; n ¼ 182 observations u s u r log(a) log(n) Clay Sand Silt BD C% u s u r log(a) log(n) Clay Sand Silt BD C% Clay(,2 m), Sand( m), Silt(50 2 m), BD is the dry bulk density [g cm 23 ] and C% the carbon content. 11

10 12 variables involved. Multicollinearity has a deleterious effect on the least squares parameter estimation and regression methodology is very sensitive to this problem. Gunst and Mason (1980) sum up four different fields of the regression analysis that are adversely affected by multicollinearity: (1) the numerical values of estimated coefficients; (2) the variance covariance matrix (variance inflation); (3) test statistics; (4) predicted responses. Different strategies are available to handle multicollinearity. The first one is to delete one of the involved parameters. The difficulty is often to decide which parameter to delete, especially when a good prediction is important. Another possibility is to use biased regression estimators for multicollinearity through deletion of those eigenvectors defining multicollinearity. Principal component regression is based on the elimination of eigenvectors with small eigenvalues, resulting in a biased estimate. Another type of principal component regression is given by Freund and Littell (1986), where all predictor variables are transformed to principal components. Because they are all uncorrelated, no need exists to use some kind of variable selection technique. Eigenvalues regression is conceptually the same as the first type of principal component analysis, except that eigenvalues and eigenvectors are differently extracted and the criteria are different for deleting multicollinear variables. Yet one more alternative is the ridge regression where the effect of the eigenvectors defining multicollinearity is strongly reduced by addition of a small constant K to diagonal elements of the X T s X s matrix. The key problem of this alternative is to determine the value of this ridge parameter. A correlation matrix helps to choose the predictor variables that should be used for the regression. Table 2 shows a correlation matrix of the Van Genuchten parameters (response variables) with parameter m ¼ 1; textural information, bulk density and carbon content (predictor variables). A significant correlation among the response variables ln(a) and ln(n) can be found, which can be explained by the fact that soils having a clearly defined air entry value like coarsely textured soils and thus having a relative large ln(a) are in general characterized by a narrow pore size distribution represented by a large ln(n) value. The opposite is true for the finer textured soils. Among the predictor variables, a clear correlation exists, e.g., between the silt and the sand content, indicating that these parameters should not be combined in a regression equation, whereas the high correlation coefficient between u s and bulk density indicates the use of bulk density for the prediction of u s Model fit Evaluation of the goodness-of-fit of a model is based on the partitioning of the total sum of squares (TSS) representing the total variability in the database, expressed in an analysis of variance table. The TSS can be partitioned into three components: TSS ¼ SSM þ SSR þ SSE ð14þ where SSR is the sum of squares attributable to the model and TSS is equal to: TSS ¼ Y T Y ð15þ

11 13 where Y is the column vector representing the dependent variable and Y T is the transposed column vector of the dependent variable. SSM is the sum of squares attributable to the mean: SSM ¼ n 21 ði T YÞ 2 ð16þ where I T is the transposed unit column vector. SSE the residual sum of squares is written as: SSE ¼ e T e ð17þ where e is the error vector and e T is the transposed error vector. Parallel with the TSS the total degrees of freedom (d f ) can be partitioned as: SSM ¼ 1d f ; SSR ¼ n 2 p 2 1d f ; SSE ¼ pd f and TSS ¼ nd f : Most analysis packages give the corrected TSS in the form of an ANOVA table expressed as: CSS ¼ TSS 2 SSM ð18þ Dividing each of the right-hand sided terms by its degrees of freedom results in their respective mean squares. Ratios of mean squares, called F-ratios, can be used to test the hypothesis regarding the model parameters. An important F-test is: F r ¼ MSR MSE ð19þ F r is used to check whether the model is capable of explaining the variation in the data. The F-test is designed to perform simultaneous hypothesis tests on model parameters, while the t-test is used for individual hypothesis tests on the parameters. An important value to evaluate the goodness-of-fit is the coefficient of determination R 2. It is a measure for the amount of variability explained by the model. A possible way to calculate the term is: R 2 ¼ SSR SSR þ SSE ¼ SSR TSS 2 SSM ð20þ According to Kvålseth (1985), there is no unanimous agreement on the mathematical description of R 2 and at least eight different equations are in use. For some nonlinear models, certain expressions for the coefficient of determination yield values greater than one, as it is the case for the equation mentioned above. To account for the number of variables in the model, an adjusted coefficient of determination is used, which can be written as: R 2 adj ¼ 1 2 ð1 2 R 2 n 2 1 Þ n 2 m 2 1 where n is the number of observations and m is the number of variables. ð21þ

12 Poor model specification Two main reasons can cause the poor specification of a model: the omission of variables and/or the poor formulation of the functional expression of some or all variables in the model. Preliminary indications concerning the latter type or misspecification can be found from scatter plots, relating model variables (response, predictor) one to another (Figure 1). At the beginning of the statistical model building, necessary transformations of variables should be performed, otherwise important predictors can be lost. Once a model is built, misspecification of both types can be deduced from two types of residual plots: plots of residuals versus predictor variables (RVP plots) and partial residual plots (PR plots). Raw residuals, studentized residuals or standardized residuals can be used, but most frequently the plots involve the raw residuals. RVP plots provide information regarding the functional form of the predictor variables and the need for extra variables such as cross products or quadratic terms. PR plots give also information about the correct functional form of predictors, but assess in addition the nonlinearity in a predictor variable and the importance of a predictor variable in presence of others. These plots can only be used when the model is linear in the parameters. A numerical analytical technique to assess model inadequacy is the lack of fit test (Draper and Smith, 1966). This test can be performed if repeated measurements are available. The availability of repeated measurements enables to calculate the sum of squared errors due to pure error and to partition the total sum of squared errors into two components. SSE ¼ SSE p þ SSE l ; ð22þ where SSE p is the pure error component and SSE l the component due to lack of fit. The test whether this lack of fit is significant or not, is performed by the following ratio: F ¼ MSE l MSE p ð23þ which is an F-test, with MSE p equal to: MSE p ¼ X k X n i i¼1 j¼1 X k i¼1 ðy ij 2 Y i Þ 2 n i 2 k ð24þ where k is the number of different points in the prediction range where measurements have been made; n i the number of repeated observations in a given point of the prediction range. The MSE (mean squared error) is obtained from the regression analysis. The difference between the MSE and the MSE p gives the MSE due to lack of fit. The ratio given in equation (23) is then compared against the 100ð1 2 aþ% point of an F-distribution. If this ratio is not significant, then there is no reason to doubt about the adequacy of the model. In the opposite case, there is a considerable bias term and attempts should be made to discover where and how the inadequacy is generated. Model misspecification due to the

13 15 exclusion of important variables may not be detected with this technique, owing to the fact that both MSE p and MSE l are biased. In order to be able to use the lack of fit test, repeated measurements should be available. Although often no real repeated measurements (identical predictor value set) are available, the error component can be estimated by assuming that the hydraulic parameters for the same site and horizons are considered as repeated measurements Confidence intervals on estimated soil properties values Two types of intervals related to the response variables can be distinguished: (1) confidence interval on the expected responses; and (2) prediction interval for future responses. For the first case, the confidence interval can be written as: ^Y ^ t v; qffiffiffiffiffiffiffi s u T o Su 2a o ð25þ where u o is the column vector for a specific set of values of predictor variables, S the variance covariance matrix, s the mean squared residual, v is equal to the degrees of freedom ðn 2 p 2 1Þ and 1 2 1/2a is the confidence level, with a being the level of significance.the limits for the prediction interval for a future observation are: ^Y ^ t v; qffiffiffiffiffiffiffiffiffiffiffi ffi s 1 þ u T o Su 2a o ð26þ 3.4. Outlier detection The purpose of the detection of outliers is to identify observations having extremely large residuals, which do not fit in the pattern of the remaining data. Identification of outliers remains, even with the above definition, a very subjective and risky business and asks for careful examination. To detect outliers, the majority of plots and tests make use of studentized residuals. These studentized residuals behave more like standard normal deviates than either raw or standardized residuals, because they are divided by their standard error. Even then, raw and standardized residuals are still frequently used. The best is to combine all three types of residuals in an analysis. Commonly used graphical methods are scatter plots of the original response variables against predictor variables, relative frequency histograms of studentized residuals, plots of residuals against predictors or fitted responses and plots of residuals against deleted residuals detected as outliers. A deleted residual is defined as: r ð2iþ ¼ Y i 2 u T i b^ð2iþ ð27þ where u T i is the 1st row of the data matrix and b^ð2iþ the estimated regression coefficient vector for n-1 observations with the ith case removed. The distribution of the residuals is equal to that of a student t-test with N-p-2 degrees of freedom. Its value can be obtained from a transformation of the raw residuals.

14 16 Most of the statistical measures developed to detect outliers are based on the evaluation of the effect of the deletion of an observation on the estimated regression coefficient vector. These statistical tests are a measure for the distance between the estimated regression coefficient vector from the null observation set and the coefficient vector with one observation deleted. The concept is based on the multivariate confidence regions for the regression coefficient vector written as: ð b^ 2 bþ T X T Xð b^ 2 bþ ðp þ 1ÞMSE # F ð12aþ with ðp þ 1; n 2 p 2 1Þ degrees of freedom ð28þ A measure for the distance or closeness between ð b^ 2 bþ is: D i ¼ ð b^ 2 b^ð2iþþx 0 Xð b^ 2 bþ ðp þ 1ÞMSE ð29þ The equation above defines the size of the confidence region containing b^ð2iþ by finding the F 12aðpþ1;n2p21Þ value corresponding to the D 1 value and evaluating the ð1 2 aþ confidence level. Practically, b^ð2iþ should stay close to b: If that is not true, the observation should be rejected. Different simplifications have been introduced in the previous equation: D i ¼ h ii r 2 ð2iþ ðp þ 1ÞMSE or D i ¼ t2 =p þ 1 h ii =1 2 h ii ð30þ with h ii the diagonal elements of XðX T XÞ 21 X T ; measuring the influence of each observation, t i the ith studentized residual and r ð2iþ the deleted residual. Separate examination of t i ; h ii =ð1 2 h ii Þ and D i are useful whether or not an observation is to be deleted. The ratio h ii =ð1 2 h ii Þ denotes the influence that the associated response Y i has on the determination of b^. Even with all these possible techniques and graphs, one has to be careful in deleting observations because this often leads to forcing a linear relationship through data exhibiting a nonlinear behavior or showing interaction effects. This is especially true when not enough data are available over the complete range of interest. Using forced equations for prediction purposes can lead to serious mispredictions. 4. VALIDATION OF REGRESSION MODELS Validation of PTFs, developed by means of regression analysis is frequently overlooked. This is especially the case when variable selection techniques have been used to construct the model. These techniques display a penchant towards capitalizing on chance variation (Green and Caroll, 1978) for specific data sets, introducing too many independent variables. Absence of any validation is mainly due to the lack of additional data to perform the validation on. Basically, the regression models can be validated in two ways. The model can be statistically evaluated, e.g., with the estimated regression coefficients or by crossvalidation. In addition, models can be practically evaluated towards the further use of the

15 17 estimated responses as input in other models. Only the statistical validation will be considered here. A commonly applied method is the double cross-validation as designed by Green and Caroll (1978). The advantages of this method are that no additional data are needed and its simplicity. The method is designed to evaluate the stability of the estimated regression coefficients and the prediction level of the equation. In a first step, the complete set of observations is randomly split in two halves. On each of the halves, a regression analysis is performed using the variables retained on the complete set of observations, as found in the modeling procedure. Then the regression model obtained from the first half is applied to the second half and vice versa. For each case, the simple correlation between observed and estimated values is calculated. Table 3 gives the results of a stepwise regression analysis to predict u s from bulk density and clay content (Vereecken et al., 1989). This regression model is adequate, because the Table 3 Results of the stepwise regression analysis and cross validation for the water content at saturation u s (response variable). u s Retained variables and estimated regression coefficients at a 5% Partial R 2 adj Model R 2 adj F-value for lack of fit F-value at 95% BD þ BD þ (þ) BD þ (þ) BD is the dry bulk density (g cm 23 ), Clay the clay content [%] and (þ) double crossvalidation. F-value of the lack of fit test is clearly smaller than the critical F-value at 95%. The double cross-validation procedure shows a maximum coefficient of determination of 87%. A comparable procedure to the technique described above is the jackknife method (Pachepsky and Rawls, 2003), where a secondary data set is necessary. It is used to validate the model developed from the prediction data set by predicting the values of the independent data set. For both cross-validation approaches the mean absolute error, mean square error, root mean square error and graphical plots of predicted versus measured values are used to assess the quality of the predictions (Müller et al., 2001). 5. SUMMARY We described the use of statistical regression techniques to develop pedotransfer functions (PTFs) estimating hydraulic properties from basic soil properties. In this section, PTFs are considered as regression models with soil data as predictor variables and hydraulic properties as response variables. Three basic steps are presented that are usually applied in an iterative procedure: Analysis of the soil data, the model building step and the

16 18 model validation. Different methods are presented on how to analyze the soil data. They range from simple scatter plots providing, e.g., information on the type of relationship between variables and the existence of outliers to multivariate statistical analyses allowing to examine the soil data in a holistic manner. Analysis of the statistical distribution of predictor and response variables may provide important information for the modelbuilding step. Extremely useful is the application of principal component analysis to examine the linear dependence between variables. Due to inherent correlation between predictor variables used in PTFs, care needs to be taken to avoid the problem of multicollinearity. This can be avoided by transforming predictor variables to independent variables using principal component analysis. The second step in developing PTF consists in the building of the regression model. Available methods include, e.g., backward and forward regression techniques, stepwise and stagewise methods. Once a first acceptable model is identified, six basic topics need to be checked: verification of the error assumption, goodness-of-fit of the model, identification of model misspecification, examination of confidence intervals on estimated regression coefficients and response variables and finally outlier detection. The model-building step is an iterative procedure requiring many iteration steps to find the best PTF model. The last step consists in the validation of the regression model or PTF. This step is usually overlooked but it is essential in establishing confidence in the developed model. Two basic but completely different methods are available: functional validation of the models and statistical validation. Functional validation aims at examining the variability of the outcome of simulation model (e.g., water balance, solute transport in soils) for a specific application caused by the uncertainty in PTF. In this chapter, we focused on statistical validation aiming at checking the validity of the prediction level (e.g., coefficient of determination) and the stability of the estimated regression coefficients using the double cross-validation technique. REFERENCES Cosby, B.J., Hornberger, G.M., Clapp, R.B., Ginn, T.R., A statistical exploration of the relationships of soil moisture characteristics to the physical properties of soil. Water Resour. Res. 20, Davis, J.C., Statistics and Data Analysis in Geology. John Wiley and Sons, New York. Draper, N.R., Smith, H., Applied Regression Analysis. John Wiley, New York. Freund, R.J., Littell, R., SAS System for Linear Models. SAS Institute Inc., Cary, NC. Green, P.E., Caroll, J.D., Analyzing Multivariate Data. John Wiley, New York. Gunst, F.R., Mason, R.L., Regression Analysis and Its Applications. A Data Oriented Approach. Marcel Dekker Inc., New York. Herbst, M., Diekkrüger, B., The influence of the spatial structure of soil properties on water balance modeling in a microscale catchment. Physics and Chemistry of the Earth, Part B 27, Jollife, I.T., Principal Component Analysis. Springer Verlag, New York. Kvålseth, T.O., Cautionary note about R 2. The American Statistician 39,

17 Müller, T.G., Pierce, F.J., Schabenberger, O., Warncke, D.D., Map quality for sitespecific management. Soil Sci. Soc. Am. J. 65, Pachepsky, Y., Rawls, W.J., Soil structure and pedotransfer functions. Eur. J. Soil Sci. 54, Pachepsky, Y., Shcherbakov, R.A., Varallyay, G., Raijkai, K., Statistical analysis of water retention relations with other physical properties of soils. Pochvovedenie 2, 42-52, (in Russian, English Abstr.). Puckett, W.E., Dane, J.H., Hajek, B.F., Physical and mineralogical data to determine soil hydraulic properties. Soil Sci. Soc. Am. J. 49, Rawls, W.J., Brakensiek, D.L., Prediction of soil water properties for hydrologic modelling. In: Jones, E.B., Ward, T.J. (Eds.), Proceedings of the Symposium of Watershed Management in the Eighties. April 30 May 1, 1985, Denver, CO. Am. Soc. Civil Engng, New York, NY, pp Rawls, W.J., Brakensiek, D.L., Estimation of soil water retention and hydraulic properties. In: Morel-Seytoux, H.J. (Ed.), Unsaturated Flow in Hydrological Modeling: Theory and Practice. Kluwer Academic Publishers, Dordrecht, pp Scheinost, A.C., Sinowski, W., Auerswald, K., Regionalization of soil water retention curves in a highly variable soilscape. I. Developing a new pedotransfer function. Geoderma 78, Van Genuchten, M.T., A closed form equation for predicting the hydraulic conductivity of unsaturated soils. Soil Sci. Soc. Am. J. 44, Vereecken, H., Pedotransfer functions for the generation of hydraulic properties for belgian soils. Thesis, Doctoraatsproefschrift Nr. 171 aan de Fakulteit der Landbouwwetenschappen van de Katholieke Universiteit te Leuven, pp Vereecken, H., Feyen, J., Maes, J., Darius, P., Estimating the soil moisture retention characteristic from texture, bulk density and carbon content. Soil Sci. 148, Vereecken, H., Maes, J., Feyen, J., Estimating unsaturated hydraulic conductivity from easily measured soil properties. Soil Sci. 149, Verheye, W., Ameryckx, J., Mineral fractions and classification of soil texture. Pedologie 2, Wösten, J.H.M., Pedotransfer functions to evaluate soil quality. In: Gregorich, E.G., Carter, M.R. (Eds.), Soil Quality for Crop Production and Ecosystem Health. Developments in Soils Science, Vol. 25. Elsevier, Amsterdam, pp Wösten, J.H.M., Lilly, A., Nemes, A., Le Bas, C., Development and use of a database of hydraulic properties of European soils. Geoderma 90,

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)

Dr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines) Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

Regression Model Building

Regression Model Building Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)

12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL) 12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R. Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Confidence Intervals, Testing and ANOVA Summary

Confidence Intervals, Testing and ANOVA Summary Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0

More information

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing

More information

Multiple Linear Regression

Multiple Linear Regression Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach

More information

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1 Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr

More information

Small sample corrections for LTS and MCD

Small sample corrections for LTS and MCD Metrika (2002) 55: 111 123 > Springer-Verlag 2002 Small sample corrections for LTS and MCD G. Pison, S. Van Aelst*, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS Page 1 MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level

More information

Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure

Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure ISSN 2278 0211 (Online) Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure Oti, Eric Uchenna Lecturer, Department of Statistics,

More information

STAT 4385 Topic 06: Model Diagnostics

STAT 4385 Topic 06: Model Diagnostics STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized

More information

with the usual assumptions about the error term. The two values of X 1 X 2 0 1

with the usual assumptions about the error term. The two values of X 1 X 2 0 1 Sample questions 1. A researcher is investigating the effects of two factors, X 1 and X 2, each at 2 levels, on a response variable Y. A balanced two-factor factorial design is used with 1 replicate. The

More information

Chapter 14 Student Lecture Notes 14-1

Chapter 14 Student Lecture Notes 14-1 Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this

More information

Principles of soil water and heat transfer in JULES

Principles of soil water and heat transfer in JULES Principles of soil water and heat transfer in JULES Anne Verhoef 1, Pier Luigi Vidale 2, Raquel Garcia- Gonzalez 1,2, and Marie-Estelle Demory 2 1. Soil Research Centre, Reading (UK); 2. NCAS-Climate,

More information

Lecture 10 Multiple Linear Regression

Lecture 10 Multiple Linear Regression Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Chapter 3 Multiple Regression Complete Example

Chapter 3 Multiple Regression Complete Example Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be

More information

Bootstrapping, Randomization, 2B-PLS

Bootstrapping, Randomization, 2B-PLS Bootstrapping, Randomization, 2B-PLS Statistics, Tests, and Bootstrapping Statistic a measure that summarizes some feature of a set of data (e.g., mean, standard deviation, skew, coefficient of variation,

More information

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC

AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC Journal of Applied Statistical Science ISSN 1067-5817 Volume 14, Number 3/4, pp. 225-235 2005 Nova Science Publishers, Inc. AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC FOR TWO-FACTOR ANALYSIS OF VARIANCE

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there

More information

A SAS/AF Application For Sample Size And Power Determination

A SAS/AF Application For Sample Size And Power Determination A SAS/AF Application For Sample Size And Power Determination Fiona Portwood, Software Product Services Ltd. Abstract When planning a study, such as a clinical trial or toxicology experiment, the choice

More information

Chapter 4: Regression Models

Chapter 4: Regression Models Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

holding all other predictors constant

holding all other predictors constant Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing

More information

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur

Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 29 Multivariate Linear Regression- Model

More information

Formal Statement of Simple Linear Regression Model

Formal Statement of Simple Linear Regression Model Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor

More information

Multiple linear regression S6

Multiple linear regression S6 Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple

More information

Experimental designs for precise parameter estimation for non-linear models

Experimental designs for precise parameter estimation for non-linear models Minerals Engineering 17 (2004) 431 436 This article is also available online at: www.elsevier.com/locate/mineng Experimental designs for precise parameter estimation for non-linear models Z. Xiao a, *,

More information

Analysis of variance, multivariate (MANOVA)

Analysis of variance, multivariate (MANOVA) Analysis of variance, multivariate (MANOVA) Abstract: A designed experiment is set up in which the system studied is under the control of an investigator. The individuals, the treatments, the variables

More information

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations Basics of Experimental Design Review of Statistics And Experimental Design Scientists study relation between variables In the context of experiments these variables are called independent and dependent

More information

1 Introduction to Minitab

1 Introduction to Minitab 1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you

More information

Dimensionality Reduction Techniques (DRT)

Dimensionality Reduction Techniques (DRT) Dimensionality Reduction Techniques (DRT) Introduction: Sometimes we have lot of variables in the data for analysis which create multidimensional matrix. To simplify calculation and to get appropriate,

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION

COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,

More information

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course. Name of the course Statistical methods and data analysis Audience The course is intended for students of the first or second year of the Graduate School in Materials Engineering. The aim of the course

More information

UCLA STAT 233 Statistical Methods in Biomedical Imaging

UCLA STAT 233 Statistical Methods in Biomedical Imaging UCLA STAT 233 Statistical Methods in Biomedical Imaging Instructor: Ivo Dinov, Asst. Prof. In Statistics and Neurology University of California, Los Angeles, Spring 2004 http://www.stat.ucla.edu/~dinov/

More information

An Introduction to Multivariate Statistical Analysis

An Introduction to Multivariate Statistical Analysis An Introduction to Multivariate Statistical Analysis Third Edition T. W. ANDERSON Stanford University Department of Statistics Stanford, CA WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Multivariate Regression (Chapter 10)

Multivariate Regression (Chapter 10) Multivariate Regression (Chapter 10) This week we ll cover multivariate regression and maybe a bit of canonical correlation. Today we ll mostly review univariate multivariate regression. With multivariate

More information

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the

More information

MATH 644: Regression Analysis Methods

MATH 644: Regression Analysis Methods MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100

More information

Simple Linear Regression: One Quantitative IV

Simple Linear Regression: One Quantitative IV Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,

More information

STAT Checking Model Assumptions

STAT Checking Model Assumptions STAT 704 --- Checking Model Assumptions Recall we assumed the following in our model: (1) The regression relationship between the response and the predictor(s) specified in the model is appropriate (2)

More information

Theorems. Least squares regression

Theorems. Least squares regression Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most

More information

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics

DETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and

More information

The Principal Component Analysis

The Principal Component Analysis The Principal Component Analysis Philippe B. Laval KSU Fall 2017 Philippe B. Laval (KSU) PCA Fall 2017 1 / 27 Introduction Every 80 minutes, the two Landsat satellites go around the world, recording images

More information

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model Outline 1 Multiple Linear Regression (Estimation, Inference, Diagnostics and Remedial Measures) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression

More information

Group comparison test for independent samples

Group comparison test for independent samples Group comparison test for independent samples The purpose of the Analysis of Variance (ANOVA) is to test for significant differences between means. Supposing that: samples come from normal populations

More information

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear

Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear relationship between: - one independent variable X and -

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

NAG Fortran Library Chapter Introduction. G01 Simple Calculations on Statistical Data

NAG Fortran Library Chapter Introduction. G01 Simple Calculations on Statistical Data G01 Simple Calculations on Statistical Data Introduction G01 NAG Fortran Library Chapter Introduction G01 Simple Calculations on Statistical Data Contents 1 Scope of the Chapter... 2 2 Background to the

More information

STAT5044: Regression and Anova

STAT5044: Regression and Anova STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can

More information

Remedial Measures, Brown-Forsythe test, F test

Remedial Measures, Brown-Forsythe test, F test Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1 Remedial Measures How do we know that the regression function

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Effective unsaturated hydraulic conductivity for one-dimensional structured heterogeneity

Effective unsaturated hydraulic conductivity for one-dimensional structured heterogeneity WATER RESOURCES RESEARCH, VOL. 41, W09406, doi:10.1029/2005wr003988, 2005 Effective unsaturated hydraulic conductivity for one-dimensional structured heterogeneity A. W. Warrick Department of Soil, Water

More information

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 What is factor analysis? What are factors? Representing factors Graphs and equations Extracting factors Methods and criteria Interpreting

More information

F n and theoretical, F 0 CDF values, for the ordered sample

F n and theoretical, F 0 CDF values, for the ordered sample Material E A S E 7 AMPTIAC Jorge Luis Romeu IIT Research Institute Rome, New York STATISTICAL ANALYSIS OF MATERIAL DATA PART III: ON THE APPLICATION OF STATISTICS TO MATERIALS ANALYSIS Introduction This

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from

More information

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006 Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will

More information

Inferences for Regression

Inferences for Regression Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In

More information

unadjusted model for baseline cholesterol 22:31 Monday, April 19,

unadjusted model for baseline cholesterol 22:31 Monday, April 19, unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol

More information

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

Ch 2: Simple Linear Regression

Ch 2: Simple Linear Regression Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component

More information

Correlation Analysis

Correlation Analysis Simple Regression Correlation Analysis Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the

More information

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp Nonlinear Regression Summary... 1 Analysis Summary... 4 Plot of Fitted Model... 6 Response Surface Plots... 7 Analysis Options... 10 Reports... 11 Correlation Matrix... 12 Observed versus Predicted...

More information

Multiple Linear Regression

Multiple Linear Regression Chapter 3 Multiple Linear Regression 3.1 Introduction Multiple linear regression is in some ways a relatively straightforward extension of simple linear regression that allows for more than one independent

More information

On the relationships between the pore size distribution index and characteristics of the soil hydraulic functions

On the relationships between the pore size distribution index and characteristics of the soil hydraulic functions WATER RESOURCES RESEARCH, VOL. 41, W07019, doi:10.1029/2004wr003511, 2005 On the relationships between the pore size distribution index and characteristics of the soil hydraulic functions S. Assouline

More information

Applied Regression Analysis

Applied Regression Analysis Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of

More information

Tutorial 5: Power and Sample Size for One-way Analysis of Variance (ANOVA) with Equal Variances Across Groups. Acknowledgements:

Tutorial 5: Power and Sample Size for One-way Analysis of Variance (ANOVA) with Equal Variances Across Groups. Acknowledgements: Tutorial 5: Power and Sample Size for One-way Analysis of Variance (ANOVA) with Equal Variances Across Groups Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements:

More information

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model. Statistical Methods in Business Lecture 5. Linear Regression We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

Regression Analysis for Data Containing Outliers and High Leverage Points

Regression Analysis for Data Containing Outliers and High Leverage Points Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain

More information

Investigating Models with Two or Three Categories

Investigating Models with Two or Three Categories Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Ch. 1: Data and Distributions

Ch. 1: Data and Distributions Ch. 1: Data and Distributions Populations vs. Samples How to graphically display data Histograms, dot plots, stem plots, etc Helps to show how samples are distributed Distributions of both continuous and

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

STAT 501 EXAM I NAME Spring 1999

STAT 501 EXAM I NAME Spring 1999 STAT 501 EXAM I NAME Spring 1999 Instructions: You may use only your calculator and the attached tables and formula sheet. You can detach the tables and formula sheet from the rest of this exam. Show your

More information

NAG C Library Chapter Introduction. g01 Simple Calculations on Statistical Data

NAG C Library Chapter Introduction. g01 Simple Calculations on Statistical Data NAG C Library Chapter Introduction g01 Simple Calculations on Statistical Data Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Summary Statistics... 2 2.2 Statistical Distribution

More information

Supplementary material (Additional file 1)

Supplementary material (Additional file 1) Supplementary material (Additional file 1) Contents I. Bias-Variance Decomposition II. Supplementary Figures: Simulation model 2 and real data sets III. Supplementary Figures: Simulation model 1 IV. Molecule

More information

Revised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review

Revised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review 1. How do algebraic concepts represent real-life situations? 2. Why are algebraic expressions and equations useful? 2. Operations on rational numbers

More information

Chapter 4. Regression Models. Learning Objectives

Chapter 4. Regression Models. Learning Objectives Chapter 4 Regression Models To accompany Quantitative Analysis for Management, Eleventh Edition, by Render, Stair, and Hanna Power Point slides created by Brian Peterson Learning Objectives After completing

More information

Applied Statistics and Econometrics

Applied Statistics and Econometrics Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple

More information

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal

More information

Analysing data: regression and correlation S6 and S7

Analysing data: regression and correlation S6 and S7 Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association

More information

Linear Regression for Air Pollution Data

Linear Regression for Air Pollution Data UNIVERSITY OF TEXAS AT SAN ANTONIO Linear Regression for Air Pollution Data Liang Jing April 2008 1 1 GOAL The increasing health problems caused by traffic-related air pollution have caught more and more

More information

Chapter 13. Multiple Regression and Model Building

Chapter 13. Multiple Regression and Model Building Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General Multiple Regression Model y x x x 0 1 1 2 2... k k y is the dependent variable x, x,..., x 1 2 k the model are the

More information