Chapter 1 STATISTICAL REGRESSION. H. Vereecken* and M. Herbst
|
|
- Shauna Beasley
- 5 years ago
- Views:
Transcription
1 3 Chapter 1 STATISTICAL REGRESSION H. Vereecken* and M. Herbst Institut Agrosphäre, ICG-IV, Forschungszentrum Jülich GmbH, Leo Brandt Strabe, Jülich, Germany p Corresponding author: Tel.: þ ; fax: þ Many of the available and well-established PTFs for the prediction of soil hydraulic properties from continuous soil properties are based on statistical regressions (Pachepsky et al., 1982; Cosby et al., 1984; Rawls and Brakensiek, 1985, 1989; Puckett et al., 1985; Vereecken et al., 1989, 1990; Wösten et al., 1997, 1999; Scheinost et al., 1997). Statistical regression is concerned with the analysis and construction of dependence structures between dependent (response) variables like parameters describing the moisture retention curve and independent (predictor) variables, e.g., bulk density or textural information. Depending upon the objectives, the process of regression will differ, but it is possible to construct a general modeling approach as was proposed by Draper and Smith (1966).They distinguish the following three phases: The first phase encompasses the planning stage. At this level, the problem is defined and the objectives are specified, the a priori knowledge is screened and eventually existing data gathered. It includes a preliminary data analysis. The second phase is the genuine model building with the development of the regression models and their testing against the objectives in an iterative way. The third phase is the validation of the obtained models, including the stability of parameters, prediction over the sample space and evaluation of the model adequacy. 1. OBJECTIVES OF STATISTICAL REGRESSIONS In general, three main objectives can be distinguished when using statistical regressions to model relations between two sets of variables: (1) prediction; (2) model specification; (3) parameter estimation. In using regression analysis for prediction purposes, the concern is mainly to obtain the best possible estimation of the response variable. Correct model specification and parameter accuracy are of secondary importance. In model specification, one is mainly interested in the relative importance of individual predictor variables on the predicted responses. This implies that all variables should be available in the database and that the DEVELOPMENTS IN SOIL SCIENCE q 2004 Elsevier B.V. VOLUME 30 ISSN /DOI /S (04) All rights reserved.
2 4 model contains the correct functional form of the predictor variables. Only then can the predictor variables be correctly assessed (Gunst and Mason, 1980). Application of regression methodology, in order to estimate parameters, requires that the model is correctly specified, the predictions are accurate and that the data allow a good estimation. Limitations of the database and the inability to measure all relevant predictors constrain the estimation of the parameters. The choice of the objective criterion or criteria to evaluate the developed model is determined by the objectives. The objective function (criterion) should be the quantitative expression of the modeling objective. A widely accepted objective criterion is the coefficient of determination ðr 2 Þ; which evaluates the performance of the model to explain the variation in the data. Most of the statistically based PTFs are either multiple linear regression equations or polynomials of n th order. Multiple linear regression is a common statistical tool used for the prediction of the response variable y from a number of n predictor variables x i. A multiple linear regression equation can be written as (Herbst and Diekkrüger, 2002): y ¼ a þ Xn i¼1 b i x i þ 1 i ¼ 1; ; n ð1þ with the constant a (intercept), the regression coefficients b i and the error 1. A nonlinear regression equation based on a second-order polynomial has the following form: y ¼ a þ Xn i¼1 ðb i x i þ c i x 2 i Þþ1 i ¼ 1; ; n: ð2þ where besides from the intercept a for every predictor variable x i ; two regression coefficients b i and c i have to be determined (Rawls and Brakensiek, 1985). 2. PRELIMINARY ANALYSIS OF SOIL DATA Different techniques are available to analyze the available soil data ranging from simple descriptive statistics (first and second moment of distribution, range) to graphical techniques (scatter plots) to multivariate statistical methods Simple data analysis Scatter plots, plotting different soil properties of interest against selected response variables, are extremely useful in detecting trends and extreme measurements The latter can also be done by means of biplots. One should be cautious, however, to delete at this stage so-called outliers, except when it is clear that these observations are incorrectly specified or measured. Scatter plots give information with regard to the linear or nonlinear behavior of variables and about the kind of transformation to be performed to eliminate nonlinearity. Transformations of the response variable are to be considered when different possible predictor variables show the same nonlinear behavior with respect to the response.
3 5 Figure 1. Scatter plots of (a) the saturated water content versus bulk density and (b) the log-transformed van Genuchten s parameter a versus sand content from the data set used by Vereecken et al. (1989). Figure 1 gives two examples of scatter plots (Vereecken, 1988). Figure 1a reveals the linear relationship between the saturated water content u s (response) and bulk density (predictor), while Figure 1b exhibits the positive correlation between the sand content and log-transformed van Genuchten s a (Van Genuchten, 1980). With statistical inferences and hypothesis testing in mind, it is interesting to examine the distributions of response variables. Examinations of sample distributions give information about transformations in order to obtain distributions more similar to the normal distribution, which is a precondition for the statistical regression techniques explained in Section 3. Frequently used numerical tests to evaluate the normality of a distribution are the Kolmogorov Smirnov and Shapiro Wilk statistics. The Kolmogorov Smirnov statistic is usually used for data sets including more than 50 observations, while in other cases the Shapiro Wilk statistic is used. The one-sample Kolmogorov Smirnov test calculates the D-value, which is the absolute maximum difference between the cumulative sample distribution and the cumulative distribution of a normal population. A two-sample Kolmogorov Smirnov is used to test whether two samples are drawn from the same population. Rather than measuring differences of means and variances of the populations, the Kolmogorov Smirnov statistic measures differences in shapes. The test statistic is a function of the sample size and can be either one- or two-tailed, testing the null hypothesis that the sample data are random samples from a normal distribution. Critical values for a specific level of significance can be looked up in special tables in order to decide whether or not the null hypothesis is to be rejected. For sample sets smaller than 50 observations the Shapiro Wilk statistics should be applied. This kind of statistics is the ratio of the best estimator of the variance to the usual corrected sum of squares estimator of the variance. The value of W ranges from 0 to 1, with small values rejecting the null hypothesis that the sample is drawn from a normally distributed population.
4 6 Information about the shape of distribution can be obtained from the third and fourth moments of the distribution. The third moment measures the skewness of the distribution S: S ¼ n X n ðx i 2 XÞ 3 ðn 2 1Þðn 2 2Þ i¼1 s 3 where s is the standard deviation, X i is the observation value, X is the mean value and n is the number of observations. A population is said to be positive or rightly skewed, if the tail occurs in the smallest values. A negatively or left skewed population has the tail in the larger values (Figure 2). For normally distributed data, the skewness equals zero. ð3þ Figure 2. Histogram (bars) and cumulative relative frequency (solid line) of van Genuchten s a (cm 21 ) and the log-transformed a from the data set used by Vereecken et al. (1989), n ¼ 182 observations. The fourth moment of the distribution measures the extent of heaviness of the tails of a distribution. A heavily tailed distribution has a positive kurtosis, a flat distribution with short tail has a negative kurtosis: K ¼ nðn þ 1Þ X n ðn 2 1Þðn 2 2Þðn 2 3Þ i¼1 ðx i 2 XÞ 4 s 4 2 3ðn 2 1Þ 2 ðn 2 2Þðn 2 3Þ ð4þ The measure of kurtosis K is zero for a normally distributed population. An alternative to the numerical tests concerning the distribution of response variables is the use of graphical aids. Basically there are three types of data plots: (1) the stem and leaf plot; (2) the box or schematic plot; (3) the normal probability plot.
5 7 They often give a better overall view of the extent of non-normal behavior of the response variables. The examination of the distribution of variables is mainly based on the analysis of the plots. The example in Table 1 reveals the strong increase in W-value for the log-transformed a and n values compared to the original variables (see also Figure 2). The distribution of u r Table 1 The W-value, the skewness and kurtosis of the parameters of the moisture retention characteristic of the data set used by Vereecken et al. (1989) W-value Skewness Kurtosis u r u s a n ln(a) ln(n) Soil hydraulic properties are based on the Van Genuchten model (u 2 u r Þ=ðu s 2 u r Þ¼ ð1 þ lahl n Þ 2m with residual water content u r, saturated water content u s, inverse of the bubbling pressure a; shape parameters n and m and pressure head h (Van Genuchten, 1980) with m ¼ 1; N ¼ 182 observations. resembles the normal distribution to the least extent having the lowest W-value and high values of skewness and kurtosis. Various transformations, however, did not result in a much higher W-value. The high probabilities of rejecting the null hypothesis are likely to be a result of the sensitivity of the Wilks test towards observations belonging to the tails of distributions Multivariate methods The principal component analysis is a powerful tool in analyzing the structure of data matrices and the interdependence between variables. Success depends upon the existence of correlations among at least some of the original variables. New variables, called principal components, are formed that they are orthogonal to each other and uncorrelated. The first principal component explains the largest part of the variation. The second one explains a part of the remaining variation, and so on until all variations have been accounted for. In total, the number of components is the same as the total number of variables. Principal component analysis is mainly applied on the standardized data matrix X s or on the correlation matrix R, which can be written as: X d ¼ X 2 I X T m X s ¼ X d D 21=2 C ¼ XT d X d N 2 1 R ¼ D 21=2 CD 21=2 ð5þ
6 8 where X is the data matrix, D is the diagonal matrix of the variances, I is the unit column vector, X T m is the row vector of means of the different variables, X d the matrix of mean corrected scores, C is the variance covariance matrix and N is the number of observations. Choice between the use of the X s or R matrix depends upon the units the variables are measured in. Typically, when the variables have been measured in the same units, the variance covariance matrix is preferred. Even then, some authors (Jollife, 1986) prefer working with the correlation matrix because of the opportunity of direct comparison between the analysis results obtained from different sets of variables. Davis (1973), however, states that in geologic studies on granulometric composition of materials, where the relative magnitudes of variables are important, it is better to use the variance covariance matrix. Derivation of the principal components is based on the solution of the eigenvalue problem of the correlation matrix or the X s matrix subjected to different constraints. For the correlation matrix, this can be expressed mathematically as: ðr 2 l j IÞ X j ¼ 0 ð6þ with lr 2 l j Il ¼ 0 for non-zero X j vectors and X T j X j ¼ 1: In matrix notation this becomes: ðr 2 D e IÞU ¼ 0 ð7þ where the columns of U contain the eigenvectors X j of R and D e is a diagonal matrix containing the eigenvalues of R that are equal to the variances of the respective principal components, such that: R ¼ UD e U 21 ð8þ Because R is a symmetric matrix, U ¼ U 21 and the previous equation becomes: R ¼ UD e U ð9þ The orientation of the different principal component axes can be found from the columns of the rotation matrix U. The non-standardized principal component scores Z are calculated as: Z ¼ X s U ð10þ Standardizing these scores, so that each component explains the same amount of variability is done as follows: Z s ¼ X s UD e ð11þ Principal component analysis resulting in standardized scores is sometimes given the name principal factor analysis. The component loading matrix, containing the correlation
7 9 of the original variables with the principal components, is calculated as: 1 F ¼ UD 2 e ð12þ Reproduction of the original matrix of standardized scores can be obtained from: X s ¼ Z s F T ¼ Z s ðzs T X s Þ: ð13þ Using equation (13), it is possible to reconstruct the original variables in a plot of which the x- and the y-axes are given by the first and second principal factor and the factor scores. The length of these reconstructed original variables is a measure for the success of reconstruction. The position and length of the reconstructed original variables is a measure of their correlation. These graphs, called biplots, enable the user to analyze the relations existing between objects described in the axes generated by the principal components and the original variables. Figure 3 is a biplot of the principal component analysis carried out for the data set of Vereecken et al. (1989). This is a plot of the reconstructed original variables and the individual observations on the first two principal factors of the variables exhibiting, e.g., that the vectors representing the clay percentage and u r are very close to one another, confirming their positive correlation (see also Table 2). Values of ln(a), ln(n) and the sand fraction are positively correlated indicating that the larger the amount of sand in the soil, the higher ln(a) and ln(n) become. The values of u s and the bulk density, pointing in the opposite direction but laying in the same line, are strongly negatively correlated. Carbon content seems to be not strongly correlated with any of the parameters. The first two principal factors explain together 64% of the variability. 3. MODEL BUILDING The regression models are constructed in an iterative way. The most important step is the selection of variables to enter the equation. This can be done either using a priori knowledge and hypothetical reasoning or by trial and error using special regression techniques. Routinely applied techniques to select variables are (Gunst and Mason, 1980): (1) all possible regression equations; (2) backward elimination; (3) forward selection; (4) stepwise method; (5) stagewise method. Discussion on each of these techniques can be found in many statistics books. Whenever using these methods to construct regression models, care should be taken not to eliminate potential predictor variables. This is one of the reasons why it is important to check the mathematical relationship between the potential predictor variables with respect to the response in the data analysis (e.g., exponential or square root dependencies). Once an acceptable model is obtained, it is examined. General examination of a regression model incorporates the following topics: (1) verification of the error assumptions; (2) assessing goodness-of-fit of the equation; (3) examination of model misspecification;
8 10 Figure 3. Plot of the reconstructed original variables and the individual observations on the first two principal factors. The length of the reconstructed variables is tripled for clearness. logðaþ ¼log e ðaþ; logðnþ ¼log e ðnþ; BD ¼ dry bulk density, C% ¼ percent organic carbon. Data points are represented by the following symbols: U ¼ heavy clay, E ¼ clay, A ¼ clay silt loam, L ¼ sandy silt loam, P ¼ light sand loam, S ¼ loamy sand, Z ¼ sand according to the belgian textural classes (Verheye and Ameryckx, 1984). (4) determination of confidence intervals on estimated regression coefficients; (5) determination of confidence intervals on estimated response variables; (6) detection of outliers. A special problem related to the estimation of parameters and their confidence intervals is the multicollinearity. Multicollinearity is the problem of redundant information in the predictor set, and is defined as an approximate linear dependence between predictor variables. An extreme form of multicollinearity is exact linear dependence. Multicollinearity can be examined pairwise by means of a correlation analysis while multivariable collinearities can be detected by examination of the eigenvalues and eigenvectors of the X T s X s matrix (Section 2.2). Small eigenvalues reveal multicollinearity while the large scores of the variables for the corresponding eigenvectors identify the
9 Table 2 Correlation matrix of predictor and response variables and the corresponding significance levels of the data set used by Vereecken et al. (1989), soil hydraulic properties are based on the Van Genuchten model with m ¼ 1; n ¼ 182 observations u s u r log(a) log(n) Clay Sand Silt BD C% u s u r log(a) log(n) Clay Sand Silt BD C% Clay(,2 m), Sand( m), Silt(50 2 m), BD is the dry bulk density [g cm 23 ] and C% the carbon content. 11
10 12 variables involved. Multicollinearity has a deleterious effect on the least squares parameter estimation and regression methodology is very sensitive to this problem. Gunst and Mason (1980) sum up four different fields of the regression analysis that are adversely affected by multicollinearity: (1) the numerical values of estimated coefficients; (2) the variance covariance matrix (variance inflation); (3) test statistics; (4) predicted responses. Different strategies are available to handle multicollinearity. The first one is to delete one of the involved parameters. The difficulty is often to decide which parameter to delete, especially when a good prediction is important. Another possibility is to use biased regression estimators for multicollinearity through deletion of those eigenvectors defining multicollinearity. Principal component regression is based on the elimination of eigenvectors with small eigenvalues, resulting in a biased estimate. Another type of principal component regression is given by Freund and Littell (1986), where all predictor variables are transformed to principal components. Because they are all uncorrelated, no need exists to use some kind of variable selection technique. Eigenvalues regression is conceptually the same as the first type of principal component analysis, except that eigenvalues and eigenvectors are differently extracted and the criteria are different for deleting multicollinear variables. Yet one more alternative is the ridge regression where the effect of the eigenvectors defining multicollinearity is strongly reduced by addition of a small constant K to diagonal elements of the X T s X s matrix. The key problem of this alternative is to determine the value of this ridge parameter. A correlation matrix helps to choose the predictor variables that should be used for the regression. Table 2 shows a correlation matrix of the Van Genuchten parameters (response variables) with parameter m ¼ 1; textural information, bulk density and carbon content (predictor variables). A significant correlation among the response variables ln(a) and ln(n) can be found, which can be explained by the fact that soils having a clearly defined air entry value like coarsely textured soils and thus having a relative large ln(a) are in general characterized by a narrow pore size distribution represented by a large ln(n) value. The opposite is true for the finer textured soils. Among the predictor variables, a clear correlation exists, e.g., between the silt and the sand content, indicating that these parameters should not be combined in a regression equation, whereas the high correlation coefficient between u s and bulk density indicates the use of bulk density for the prediction of u s Model fit Evaluation of the goodness-of-fit of a model is based on the partitioning of the total sum of squares (TSS) representing the total variability in the database, expressed in an analysis of variance table. The TSS can be partitioned into three components: TSS ¼ SSM þ SSR þ SSE ð14þ where SSR is the sum of squares attributable to the model and TSS is equal to: TSS ¼ Y T Y ð15þ
11 13 where Y is the column vector representing the dependent variable and Y T is the transposed column vector of the dependent variable. SSM is the sum of squares attributable to the mean: SSM ¼ n 21 ði T YÞ 2 ð16þ where I T is the transposed unit column vector. SSE the residual sum of squares is written as: SSE ¼ e T e ð17þ where e is the error vector and e T is the transposed error vector. Parallel with the TSS the total degrees of freedom (d f ) can be partitioned as: SSM ¼ 1d f ; SSR ¼ n 2 p 2 1d f ; SSE ¼ pd f and TSS ¼ nd f : Most analysis packages give the corrected TSS in the form of an ANOVA table expressed as: CSS ¼ TSS 2 SSM ð18þ Dividing each of the right-hand sided terms by its degrees of freedom results in their respective mean squares. Ratios of mean squares, called F-ratios, can be used to test the hypothesis regarding the model parameters. An important F-test is: F r ¼ MSR MSE ð19þ F r is used to check whether the model is capable of explaining the variation in the data. The F-test is designed to perform simultaneous hypothesis tests on model parameters, while the t-test is used for individual hypothesis tests on the parameters. An important value to evaluate the goodness-of-fit is the coefficient of determination R 2. It is a measure for the amount of variability explained by the model. A possible way to calculate the term is: R 2 ¼ SSR SSR þ SSE ¼ SSR TSS 2 SSM ð20þ According to Kvålseth (1985), there is no unanimous agreement on the mathematical description of R 2 and at least eight different equations are in use. For some nonlinear models, certain expressions for the coefficient of determination yield values greater than one, as it is the case for the equation mentioned above. To account for the number of variables in the model, an adjusted coefficient of determination is used, which can be written as: R 2 adj ¼ 1 2 ð1 2 R 2 n 2 1 Þ n 2 m 2 1 where n is the number of observations and m is the number of variables. ð21þ
12 Poor model specification Two main reasons can cause the poor specification of a model: the omission of variables and/or the poor formulation of the functional expression of some or all variables in the model. Preliminary indications concerning the latter type or misspecification can be found from scatter plots, relating model variables (response, predictor) one to another (Figure 1). At the beginning of the statistical model building, necessary transformations of variables should be performed, otherwise important predictors can be lost. Once a model is built, misspecification of both types can be deduced from two types of residual plots: plots of residuals versus predictor variables (RVP plots) and partial residual plots (PR plots). Raw residuals, studentized residuals or standardized residuals can be used, but most frequently the plots involve the raw residuals. RVP plots provide information regarding the functional form of the predictor variables and the need for extra variables such as cross products or quadratic terms. PR plots give also information about the correct functional form of predictors, but assess in addition the nonlinearity in a predictor variable and the importance of a predictor variable in presence of others. These plots can only be used when the model is linear in the parameters. A numerical analytical technique to assess model inadequacy is the lack of fit test (Draper and Smith, 1966). This test can be performed if repeated measurements are available. The availability of repeated measurements enables to calculate the sum of squared errors due to pure error and to partition the total sum of squared errors into two components. SSE ¼ SSE p þ SSE l ; ð22þ where SSE p is the pure error component and SSE l the component due to lack of fit. The test whether this lack of fit is significant or not, is performed by the following ratio: F ¼ MSE l MSE p ð23þ which is an F-test, with MSE p equal to: MSE p ¼ X k X n i i¼1 j¼1 X k i¼1 ðy ij 2 Y i Þ 2 n i 2 k ð24þ where k is the number of different points in the prediction range where measurements have been made; n i the number of repeated observations in a given point of the prediction range. The MSE (mean squared error) is obtained from the regression analysis. The difference between the MSE and the MSE p gives the MSE due to lack of fit. The ratio given in equation (23) is then compared against the 100ð1 2 aþ% point of an F-distribution. If this ratio is not significant, then there is no reason to doubt about the adequacy of the model. In the opposite case, there is a considerable bias term and attempts should be made to discover where and how the inadequacy is generated. Model misspecification due to the
13 15 exclusion of important variables may not be detected with this technique, owing to the fact that both MSE p and MSE l are biased. In order to be able to use the lack of fit test, repeated measurements should be available. Although often no real repeated measurements (identical predictor value set) are available, the error component can be estimated by assuming that the hydraulic parameters for the same site and horizons are considered as repeated measurements Confidence intervals on estimated soil properties values Two types of intervals related to the response variables can be distinguished: (1) confidence interval on the expected responses; and (2) prediction interval for future responses. For the first case, the confidence interval can be written as: ^Y ^ t v; qffiffiffiffiffiffiffi s u T o Su 2a o ð25þ where u o is the column vector for a specific set of values of predictor variables, S the variance covariance matrix, s the mean squared residual, v is equal to the degrees of freedom ðn 2 p 2 1Þ and 1 2 1/2a is the confidence level, with a being the level of significance.the limits for the prediction interval for a future observation are: ^Y ^ t v; qffiffiffiffiffiffiffiffiffiffiffi ffi s 1 þ u T o Su 2a o ð26þ 3.4. Outlier detection The purpose of the detection of outliers is to identify observations having extremely large residuals, which do not fit in the pattern of the remaining data. Identification of outliers remains, even with the above definition, a very subjective and risky business and asks for careful examination. To detect outliers, the majority of plots and tests make use of studentized residuals. These studentized residuals behave more like standard normal deviates than either raw or standardized residuals, because they are divided by their standard error. Even then, raw and standardized residuals are still frequently used. The best is to combine all three types of residuals in an analysis. Commonly used graphical methods are scatter plots of the original response variables against predictor variables, relative frequency histograms of studentized residuals, plots of residuals against predictors or fitted responses and plots of residuals against deleted residuals detected as outliers. A deleted residual is defined as: r ð2iþ ¼ Y i 2 u T i b^ð2iþ ð27þ where u T i is the 1st row of the data matrix and b^ð2iþ the estimated regression coefficient vector for n-1 observations with the ith case removed. The distribution of the residuals is equal to that of a student t-test with N-p-2 degrees of freedom. Its value can be obtained from a transformation of the raw residuals.
14 16 Most of the statistical measures developed to detect outliers are based on the evaluation of the effect of the deletion of an observation on the estimated regression coefficient vector. These statistical tests are a measure for the distance between the estimated regression coefficient vector from the null observation set and the coefficient vector with one observation deleted. The concept is based on the multivariate confidence regions for the regression coefficient vector written as: ð b^ 2 bþ T X T Xð b^ 2 bþ ðp þ 1ÞMSE # F ð12aþ with ðp þ 1; n 2 p 2 1Þ degrees of freedom ð28þ A measure for the distance or closeness between ð b^ 2 bþ is: D i ¼ ð b^ 2 b^ð2iþþx 0 Xð b^ 2 bþ ðp þ 1ÞMSE ð29þ The equation above defines the size of the confidence region containing b^ð2iþ by finding the F 12aðpþ1;n2p21Þ value corresponding to the D 1 value and evaluating the ð1 2 aþ confidence level. Practically, b^ð2iþ should stay close to b: If that is not true, the observation should be rejected. Different simplifications have been introduced in the previous equation: D i ¼ h ii r 2 ð2iþ ðp þ 1ÞMSE or D i ¼ t2 =p þ 1 h ii =1 2 h ii ð30þ with h ii the diagonal elements of XðX T XÞ 21 X T ; measuring the influence of each observation, t i the ith studentized residual and r ð2iþ the deleted residual. Separate examination of t i ; h ii =ð1 2 h ii Þ and D i are useful whether or not an observation is to be deleted. The ratio h ii =ð1 2 h ii Þ denotes the influence that the associated response Y i has on the determination of b^. Even with all these possible techniques and graphs, one has to be careful in deleting observations because this often leads to forcing a linear relationship through data exhibiting a nonlinear behavior or showing interaction effects. This is especially true when not enough data are available over the complete range of interest. Using forced equations for prediction purposes can lead to serious mispredictions. 4. VALIDATION OF REGRESSION MODELS Validation of PTFs, developed by means of regression analysis is frequently overlooked. This is especially the case when variable selection techniques have been used to construct the model. These techniques display a penchant towards capitalizing on chance variation (Green and Caroll, 1978) for specific data sets, introducing too many independent variables. Absence of any validation is mainly due to the lack of additional data to perform the validation on. Basically, the regression models can be validated in two ways. The model can be statistically evaluated, e.g., with the estimated regression coefficients or by crossvalidation. In addition, models can be practically evaluated towards the further use of the
15 17 estimated responses as input in other models. Only the statistical validation will be considered here. A commonly applied method is the double cross-validation as designed by Green and Caroll (1978). The advantages of this method are that no additional data are needed and its simplicity. The method is designed to evaluate the stability of the estimated regression coefficients and the prediction level of the equation. In a first step, the complete set of observations is randomly split in two halves. On each of the halves, a regression analysis is performed using the variables retained on the complete set of observations, as found in the modeling procedure. Then the regression model obtained from the first half is applied to the second half and vice versa. For each case, the simple correlation between observed and estimated values is calculated. Table 3 gives the results of a stepwise regression analysis to predict u s from bulk density and clay content (Vereecken et al., 1989). This regression model is adequate, because the Table 3 Results of the stepwise regression analysis and cross validation for the water content at saturation u s (response variable). u s Retained variables and estimated regression coefficients at a 5% Partial R 2 adj Model R 2 adj F-value for lack of fit F-value at 95% BD þ BD þ (þ) BD þ (þ) BD is the dry bulk density (g cm 23 ), Clay the clay content [%] and (þ) double crossvalidation. F-value of the lack of fit test is clearly smaller than the critical F-value at 95%. The double cross-validation procedure shows a maximum coefficient of determination of 87%. A comparable procedure to the technique described above is the jackknife method (Pachepsky and Rawls, 2003), where a secondary data set is necessary. It is used to validate the model developed from the prediction data set by predicting the values of the independent data set. For both cross-validation approaches the mean absolute error, mean square error, root mean square error and graphical plots of predicted versus measured values are used to assess the quality of the predictions (Müller et al., 2001). 5. SUMMARY We described the use of statistical regression techniques to develop pedotransfer functions (PTFs) estimating hydraulic properties from basic soil properties. In this section, PTFs are considered as regression models with soil data as predictor variables and hydraulic properties as response variables. Three basic steps are presented that are usually applied in an iterative procedure: Analysis of the soil data, the model building step and the
16 18 model validation. Different methods are presented on how to analyze the soil data. They range from simple scatter plots providing, e.g., information on the type of relationship between variables and the existence of outliers to multivariate statistical analyses allowing to examine the soil data in a holistic manner. Analysis of the statistical distribution of predictor and response variables may provide important information for the modelbuilding step. Extremely useful is the application of principal component analysis to examine the linear dependence between variables. Due to inherent correlation between predictor variables used in PTFs, care needs to be taken to avoid the problem of multicollinearity. This can be avoided by transforming predictor variables to independent variables using principal component analysis. The second step in developing PTF consists in the building of the regression model. Available methods include, e.g., backward and forward regression techniques, stepwise and stagewise methods. Once a first acceptable model is identified, six basic topics need to be checked: verification of the error assumption, goodness-of-fit of the model, identification of model misspecification, examination of confidence intervals on estimated regression coefficients and response variables and finally outlier detection. The model-building step is an iterative procedure requiring many iteration steps to find the best PTF model. The last step consists in the validation of the regression model or PTF. This step is usually overlooked but it is essential in establishing confidence in the developed model. Two basic but completely different methods are available: functional validation of the models and statistical validation. Functional validation aims at examining the variability of the outcome of simulation model (e.g., water balance, solute transport in soils) for a specific application caused by the uncertainty in PTF. In this chapter, we focused on statistical validation aiming at checking the validity of the prediction level (e.g., coefficient of determination) and the stability of the estimated regression coefficients using the double cross-validation technique. REFERENCES Cosby, B.J., Hornberger, G.M., Clapp, R.B., Ginn, T.R., A statistical exploration of the relationships of soil moisture characteristics to the physical properties of soil. Water Resour. Res. 20, Davis, J.C., Statistics and Data Analysis in Geology. John Wiley and Sons, New York. Draper, N.R., Smith, H., Applied Regression Analysis. John Wiley, New York. Freund, R.J., Littell, R., SAS System for Linear Models. SAS Institute Inc., Cary, NC. Green, P.E., Caroll, J.D., Analyzing Multivariate Data. John Wiley, New York. Gunst, F.R., Mason, R.L., Regression Analysis and Its Applications. A Data Oriented Approach. Marcel Dekker Inc., New York. Herbst, M., Diekkrüger, B., The influence of the spatial structure of soil properties on water balance modeling in a microscale catchment. Physics and Chemistry of the Earth, Part B 27, Jollife, I.T., Principal Component Analysis. Springer Verlag, New York. Kvålseth, T.O., Cautionary note about R 2. The American Statistician 39,
17 Müller, T.G., Pierce, F.J., Schabenberger, O., Warncke, D.D., Map quality for sitespecific management. Soil Sci. Soc. Am. J. 65, Pachepsky, Y., Rawls, W.J., Soil structure and pedotransfer functions. Eur. J. Soil Sci. 54, Pachepsky, Y., Shcherbakov, R.A., Varallyay, G., Raijkai, K., Statistical analysis of water retention relations with other physical properties of soils. Pochvovedenie 2, 42-52, (in Russian, English Abstr.). Puckett, W.E., Dane, J.H., Hajek, B.F., Physical and mineralogical data to determine soil hydraulic properties. Soil Sci. Soc. Am. J. 49, Rawls, W.J., Brakensiek, D.L., Prediction of soil water properties for hydrologic modelling. In: Jones, E.B., Ward, T.J. (Eds.), Proceedings of the Symposium of Watershed Management in the Eighties. April 30 May 1, 1985, Denver, CO. Am. Soc. Civil Engng, New York, NY, pp Rawls, W.J., Brakensiek, D.L., Estimation of soil water retention and hydraulic properties. In: Morel-Seytoux, H.J. (Ed.), Unsaturated Flow in Hydrological Modeling: Theory and Practice. Kluwer Academic Publishers, Dordrecht, pp Scheinost, A.C., Sinowski, W., Auerswald, K., Regionalization of soil water retention curves in a highly variable soilscape. I. Developing a new pedotransfer function. Geoderma 78, Van Genuchten, M.T., A closed form equation for predicting the hydraulic conductivity of unsaturated soils. Soil Sci. Soc. Am. J. 44, Vereecken, H., Pedotransfer functions for the generation of hydraulic properties for belgian soils. Thesis, Doctoraatsproefschrift Nr. 171 aan de Fakulteit der Landbouwwetenschappen van de Katholieke Universiteit te Leuven, pp Vereecken, H., Feyen, J., Maes, J., Darius, P., Estimating the soil moisture retention characteristic from texture, bulk density and carbon content. Soil Sci. 148, Vereecken, H., Maes, J., Feyen, J., Estimating unsaturated hydraulic conductivity from easily measured soil properties. Soil Sci. 149, Verheye, W., Ameryckx, J., Mineral fractions and classification of soil texture. Pedologie 2, Wösten, J.H.M., Pedotransfer functions to evaluate soil quality. In: Gregorich, E.G., Carter, M.R. (Eds.), Soil Quality for Crop Production and Ecosystem Health. Developments in Soils Science, Vol. 25. Elsevier, Amsterdam, pp Wösten, J.H.M., Lilly, A., Nemes, A., Le Bas, C., Development and use of a database of hydraulic properties of European soils. Geoderma 90,
The Model Building Process Part I: Checking Model Assumptions Best Practice
The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test
More informationThe Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)
The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE
More informationDr. Maddah ENMG 617 EM Statistics 11/28/12. Multiple Regression (3) (Chapter 15, Hines)
Dr. Maddah ENMG 617 EM Statistics 11/28/12 Multiple Regression (3) (Chapter 15, Hines) Problems in multiple regression: Multicollinearity This arises when the independent variables x 1, x 2,, x k, are
More informationMath 423/533: The Main Theoretical Topics
Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)
More informationRegression Model Building
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation in Y with a small set of predictors Automated
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More information12.12 MODEL BUILDING, AND THE EFFECTS OF MULTICOLLINEARITY (OPTIONAL)
12.12 Model Building, and the Effects of Multicollinearity (Optional) 1 Although Excel and MegaStat are emphasized in Business Statistics in Practice, Second Canadian Edition, some examples in the additional
More informationFinal Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58
Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple
More informationFeature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size
Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski
More informationLinear Models 1. Isfahan University of Technology Fall Semester, 2014
Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and
More informationWiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.
Methods and Applications of Linear Models Regression and the Analysis of Variance Third Edition RONALD R. HOCKING PenHock Statistical Consultants Ishpeming, Michigan Wiley Contents Preface to the Third
More informationPrinciple Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA
Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In
More informationConfidence Intervals, Testing and ANOVA Summary
Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0
More informationANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS
ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing
More informationMultiple Linear Regression
Andrew Lonardelli December 20, 2013 Multiple Linear Regression 1 Table Of Contents Introduction: p.3 Multiple Linear Regression Model: p.3 Least Squares Estimation of the Parameters: p.4-5 The matrix approach
More informationVariable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1
Variable Selection in Restricted Linear Regression Models Y. Tuaç 1 and O. Arslan 1 Ankara University, Faculty of Science, Department of Statistics, 06100 Ankara/Turkey ytuac@ankara.edu.tr, oarslan@ankara.edu.tr
More informationSmall sample corrections for LTS and MCD
Metrika (2002) 55: 111 123 > Springer-Verlag 2002 Small sample corrections for LTS and MCD G. Pison, S. Van Aelst*, and G. Willems Department of Mathematics and Computer Science, Universitaire Instelling
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS Page 1 MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level
More informationMultiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure
ISSN 2278 0211 (Online) Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure Oti, Eric Uchenna Lecturer, Department of Statistics,
More informationSTAT 4385 Topic 06: Model Diagnostics
STAT 4385 Topic 06: Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 1/ 40 Outline Several Types of Residuals Raw, Standardized, Studentized
More informationwith the usual assumptions about the error term. The two values of X 1 X 2 0 1
Sample questions 1. A researcher is investigating the effects of two factors, X 1 and X 2, each at 2 levels, on a response variable Y. A balanced two-factor factorial design is used with 1 replicate. The
More informationChapter 14 Student Lecture Notes 14-1
Chapter 14 Student Lecture Notes 14-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter 14 Multiple Regression Analysis and Model Building Chap 14-1 Chapter Goals After completing this
More informationPrinciples of soil water and heat transfer in JULES
Principles of soil water and heat transfer in JULES Anne Verhoef 1, Pier Luigi Vidale 2, Raquel Garcia- Gonzalez 1,2, and Marie-Estelle Demory 2 1. Soil Research Centre, Reading (UK); 2. NCAS-Climate,
More informationLecture 10 Multiple Linear Regression
Lecture 10 Multiple Linear Regression STAT 512 Spring 2011 Background Reading KNNL: 6.1-6.5 10-1 Topic Overview Multiple Linear Regression Model 10-2 Data for Multiple Regression Y i is the response variable
More informationExperimental Design and Data Analysis for Biologists
Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1
More informationChapter 3 Multiple Regression Complete Example
Department of Quantitative Methods & Information Systems ECON 504 Chapter 3 Multiple Regression Complete Example Spring 2013 Dr. Mohammad Zainal Review Goals After completing this lecture, you should be
More informationBootstrapping, Randomization, 2B-PLS
Bootstrapping, Randomization, 2B-PLS Statistics, Tests, and Bootstrapping Statistic a measure that summarizes some feature of a set of data (e.g., mean, standard deviation, skew, coefficient of variation,
More informationAN IMPROVEMENT TO THE ALIGNED RANK STATISTIC
Journal of Applied Statistical Science ISSN 1067-5817 Volume 14, Number 3/4, pp. 225-235 2005 Nova Science Publishers, Inc. AN IMPROVEMENT TO THE ALIGNED RANK STATISTIC FOR TWO-FACTOR ANALYSIS OF VARIANCE
More informationMultiple Linear Regression
Multiple Linear Regression Simple linear regression tries to fit a simple line between two variables Y and X. If X is linearly related to Y this explains some of the variability in Y. In most cases, there
More informationA SAS/AF Application For Sample Size And Power Determination
A SAS/AF Application For Sample Size And Power Determination Fiona Portwood, Software Product Services Ltd. Abstract When planning a study, such as a clinical trial or toxicology experiment, the choice
More informationChapter 4: Regression Models
Sales volume of company 1 Textbook: pp. 129-164 Chapter 4: Regression Models Money spent on advertising 2 Learning Objectives After completing this chapter, students will be able to: Identify variables,
More informationAny of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.
STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed
More informationholding all other predictors constant
Multiple Regression Numeric Response variable (y) p Numeric predictor variables (p < n) Model: Y = b 0 + b 1 x 1 + + b p x p + e Partial Regression Coefficients: b i effect (on the mean response) of increasing
More informationApplied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur
Applied Multivariate Statistical Modeling Prof. J. Maiti Department of Industrial Engineering and Management Indian Institute of Technology, Kharagpur Lecture - 29 Multivariate Linear Regression- Model
More informationFormal Statement of Simple Linear Regression Model
Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor
More informationMultiple linear regression S6
Basic medical statistics for clinical and experimental research Multiple linear regression S6 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/42 Introduction Two main motivations for doing multiple
More informationExperimental designs for precise parameter estimation for non-linear models
Minerals Engineering 17 (2004) 431 436 This article is also available online at: www.elsevier.com/locate/mineng Experimental designs for precise parameter estimation for non-linear models Z. Xiao a, *,
More informationAnalysis of variance, multivariate (MANOVA)
Analysis of variance, multivariate (MANOVA) Abstract: A designed experiment is set up in which the system studied is under the control of an investigator. The individuals, the treatments, the variables
More informationBasics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations
Basics of Experimental Design Review of Statistics And Experimental Design Scientists study relation between variables In the context of experiments these variables are called independent and dependent
More information1 Introduction to Minitab
1 Introduction to Minitab Minitab is a statistical analysis software package. The software is freely available to all students and is downloadable through the Technology Tab at my.calpoly.edu. When you
More informationDimensionality Reduction Techniques (DRT)
Dimensionality Reduction Techniques (DRT) Introduction: Sometimes we have lot of variables in the data for analysis which create multidimensional matrix. To simplify calculation and to get appropriate,
More informationPrediction of Bike Rental using Model Reuse Strategy
Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu
More information401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.
401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis
More informationChapter 1 Statistical Inference
Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations
More informationCOMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION
COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,
More information* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.
Name of the course Statistical methods and data analysis Audience The course is intended for students of the first or second year of the Graduate School in Materials Engineering. The aim of the course
More informationUCLA STAT 233 Statistical Methods in Biomedical Imaging
UCLA STAT 233 Statistical Methods in Biomedical Imaging Instructor: Ivo Dinov, Asst. Prof. In Statistics and Neurology University of California, Los Angeles, Spring 2004 http://www.stat.ucla.edu/~dinov/
More informationAn Introduction to Multivariate Statistical Analysis
An Introduction to Multivariate Statistical Analysis Third Edition T. W. ANDERSON Stanford University Department of Statistics Stanford, CA WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents
More informationMS-C1620 Statistical inference
MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents
More informationMultivariate Regression (Chapter 10)
Multivariate Regression (Chapter 10) This week we ll cover multivariate regression and maybe a bit of canonical correlation. Today we ll mostly review univariate multivariate regression. With multivariate
More informationRegression Analysis. Regression: Methodology for studying the relationship among two or more variables
Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the
More informationMATH 644: Regression Analysis Methods
MATH 644: Regression Analysis Methods FINAL EXAM Fall, 2012 INSTRUCTIONS TO STUDENTS: 1. This test contains SIX questions. It comprises ELEVEN printed pages. 2. Answer ALL questions for a total of 100
More informationSimple Linear Regression: One Quantitative IV
Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,
More informationSTAT Checking Model Assumptions
STAT 704 --- Checking Model Assumptions Recall we assumed the following in our model: (1) The regression relationship between the response and the predictor(s) specified in the model is appropriate (2)
More informationTheorems. Least squares regression
Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most
More informationDETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics
DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and
More informationThe Principal Component Analysis
The Principal Component Analysis Philippe B. Laval KSU Fall 2017 Philippe B. Laval (KSU) PCA Fall 2017 1 / 27 Introduction Every 80 minutes, the two Landsat satellites go around the world, recording images
More informationOutline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model
Outline 1 Multiple Linear Regression (Estimation, Inference, Diagnostics and Remedial Measures) 2 Special Topics for Multiple Regression Extra Sums of Squares Standardized Version of the Multiple Regression
More informationGroup comparison test for independent samples
Group comparison test for independent samples The purpose of the Analysis of Variance (ANOVA) is to test for significant differences between means. Supposing that: samples come from normal populations
More informationRegression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear
Regression analysis is a tool for building mathematical and statistical models that characterize relationships between variables Finds a linear relationship between: - one independent variable X and -
More informationUnconstrained Ordination
Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)
More informationNAG Fortran Library Chapter Introduction. G01 Simple Calculations on Statistical Data
G01 Simple Calculations on Statistical Data Introduction G01 NAG Fortran Library Chapter Introduction G01 Simple Calculations on Statistical Data Contents 1 Scope of the Chapter... 2 2 Background to the
More informationSTAT5044: Regression and Anova
STAT5044: Regression and Anova Inyoung Kim 1 / 49 Outline 1 How to check assumptions 2 / 49 Assumption Linearity: scatter plot, residual plot Randomness: Run test, Durbin-Watson test when the data can
More informationRemedial Measures, Brown-Forsythe test, F test
Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1 Remedial Measures How do we know that the regression function
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationEffective unsaturated hydraulic conductivity for one-dimensional structured heterogeneity
WATER RESOURCES RESEARCH, VOL. 41, W09406, doi:10.1029/2005wr003988, 2005 Effective unsaturated hydraulic conductivity for one-dimensional structured heterogeneity A. W. Warrick Department of Soil, Water
More information2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2
PSY 512: Advanced Statistics for Psychological and Behavioral Research 2 What is factor analysis? What are factors? Representing factors Graphs and equations Extracting factors Methods and criteria Interpreting
More informationF n and theoretical, F 0 CDF values, for the ordered sample
Material E A S E 7 AMPTIAC Jorge Luis Romeu IIT Research Institute Rome, New York STATISTICAL ANALYSIS OF MATERIAL DATA PART III: ON THE APPLICATION OF STATISTICS TO MATERIALS ANALYSIS Introduction This
More informationMultiple Linear Regression
Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from
More informationKeller: Stats for Mgmt & Econ, 7th Ed July 17, 2006
Chapter 17 Simple Linear Regression and Correlation 17.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will
More informationInferences for Regression
Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In
More informationunadjusted model for baseline cholesterol 22:31 Monday, April 19,
unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol
More informationSTATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic
STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab
More informationHANDBOOK OF APPLICABLE MATHEMATICS
HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester
More informationPractical Statistics for the Analytical Scientist Table of Contents
Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning
More informationCh 2: Simple Linear Regression
Ch 2: Simple Linear Regression 1. Simple Linear Regression Model A simple regression model with a single regressor x is y = β 0 + β 1 x + ɛ, where we assume that the error ɛ is independent random component
More informationCorrelation Analysis
Simple Regression Correlation Analysis Correlation analysis is used to measure strength of the association (linear relationship) between two variables Correlation is only concerned with strength of the
More informationNonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp
Nonlinear Regression Summary... 1 Analysis Summary... 4 Plot of Fitted Model... 6 Response Surface Plots... 7 Analysis Options... 10 Reports... 11 Correlation Matrix... 12 Observed versus Predicted...
More informationMultiple Linear Regression
Chapter 3 Multiple Linear Regression 3.1 Introduction Multiple linear regression is in some ways a relatively straightforward extension of simple linear regression that allows for more than one independent
More informationOn the relationships between the pore size distribution index and characteristics of the soil hydraulic functions
WATER RESOURCES RESEARCH, VOL. 41, W07019, doi:10.1029/2004wr003511, 2005 On the relationships between the pore size distribution index and characteristics of the soil hydraulic functions S. Assouline
More informationApplied Regression Analysis
Applied Regression Analysis Chapter 3 Multiple Linear Regression Hongcheng Li April, 6, 2013 Recall simple linear regression 1 Recall simple linear regression 2 Parameter Estimation 3 Interpretations of
More informationTutorial 5: Power and Sample Size for One-way Analysis of Variance (ANOVA) with Equal Variances Across Groups. Acknowledgements:
Tutorial 5: Power and Sample Size for One-way Analysis of Variance (ANOVA) with Equal Variances Across Groups Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements:
More informationWe like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.
Statistical Methods in Business Lecture 5. Linear Regression We like to capture and represent the relationship between a set of possible causes and their response, by using a statistical predictive model.
More informationIntroduction to Machine Learning
10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what
More informationRegression Analysis for Data Containing Outliers and High Leverage Points
Alabama Journal of Mathematics 39 (2015) ISSN 2373-0404 Regression Analysis for Data Containing Outliers and High Leverage Points Asim Kumer Dey Department of Mathematics Lamar University Md. Amir Hossain
More informationInvestigating Models with Two or Three Categories
Ronald H. Heck and Lynn N. Tabata 1 Investigating Models with Two or Three Categories For the past few weeks we have been working with discriminant analysis. Let s now see what the same sort of model might
More informationCorrelation and Simple Linear Regression
Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline
More informationThis model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that
Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear
More informationCh. 1: Data and Distributions
Ch. 1: Data and Distributions Populations vs. Samples How to graphically display data Histograms, dot plots, stem plots, etc Helps to show how samples are distributed Distributions of both continuous and
More informationLogistic Regression: Regression with a Binary Dependent Variable
Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression
More informationSTAT 501 EXAM I NAME Spring 1999
STAT 501 EXAM I NAME Spring 1999 Instructions: You may use only your calculator and the attached tables and formula sheet. You can detach the tables and formula sheet from the rest of this exam. Show your
More informationNAG C Library Chapter Introduction. g01 Simple Calculations on Statistical Data
NAG C Library Chapter Introduction g01 Simple Calculations on Statistical Data Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Summary Statistics... 2 2.2 Statistical Distribution
More informationSupplementary material (Additional file 1)
Supplementary material (Additional file 1) Contents I. Bias-Variance Decomposition II. Supplementary Figures: Simulation model 2 and real data sets III. Supplementary Figures: Simulation model 1 IV. Molecule
More informationRevised: 2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review
2/19/09 Unit 1 Pre-Algebra Concepts and Operations Review 1. How do algebraic concepts represent real-life situations? 2. Why are algebraic expressions and equations useful? 2. Operations on rational numbers
More informationChapter 4. Regression Models. Learning Objectives
Chapter 4 Regression Models To accompany Quantitative Analysis for Management, Eleventh Edition, by Render, Stair, and Hanna Power Point slides created by Brian Peterson Learning Objectives After completing
More informationApplied Statistics and Econometrics
Applied Statistics and Econometrics Lecture 6 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 53 Outline of Lecture 6 1 Omitted variable bias (SW 6.1) 2 Multiple
More informationLECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS
LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal
More informationAnalysing data: regression and correlation S6 and S7
Basic medical statistics for clinical and experimental research Analysing data: regression and correlation S6 and S7 K. Jozwiak k.jozwiak@nki.nl 2 / 49 Correlation So far we have looked at the association
More informationLinear Regression for Air Pollution Data
UNIVERSITY OF TEXAS AT SAN ANTONIO Linear Regression for Air Pollution Data Liang Jing April 2008 1 1 GOAL The increasing health problems caused by traffic-related air pollution have caught more and more
More informationChapter 13. Multiple Regression and Model Building
Chapter 13 Multiple Regression and Model Building Multiple Regression Models The General Multiple Regression Model y x x x 0 1 1 2 2... k k y is the dependent variable x, x,..., x 1 2 k the model are the
More information