Factor analysis and multiple linear regression modeling

Similar documents
Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Simple Linear Regression: One Quantitative IV

An Introduction to Mplus and Path Analysis

Regression Analysis II

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Inference for Regression

An Introduction to Path Analysis

CHAPTER 3 Ionic Compounds. General, Organic, & Biological Chemistry Janice Gorzynski Smith

STAT 360-Linear Models

Ch 2: Simple Linear Regression

Simple and Multiple Linear Regression

The simple linear regression model discussed in Chapter 13 was written as

Lect. 2: Chemical Water Quality

Regression Analysis: Basic Concepts

Basic Business Statistics 6 th Edition

ANSWERS: Atoms and Ions

STAT420 Midterm Exam. University of Illinois Urbana-Champaign October 19 (Friday), :00 4:15p. SOLUTIONS (Yellow)

Net Ionic Equations. Making Sense of Chemical Reactions

Chapter 14 Simple Linear Regression (A)

Salinity. foot = 0.305m yard = 0.91m. Length. Area m 2 square feet ~0.09m2. Volume m 3 US pint ~ 0.47 L fl. oz. ~0.02 L.

Effect of rainfall and temperature on rice yield in Puri district of Odisha in India

(2) (1) (2) The isotopic composition of a sample of sulphur is found using a mass spectrometer.

Multiple Linear Regression

Variance Decomposition and Goodness of Fit

Chapter 14 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics. Chapter 14 Multiple Regression

IGCSE Double Award Extended Coordinated Science

Simple Linear Regression: One Qualitative IV

LECTURE 6. Introduction to Econometrics. Hypothesis testing & Goodness of fit

Applied Regression Analysis

FAQ: Linear and Multiple Regression Analysis: Coefficients

Salinity. See Appendix 1 of textbook x10 3 = See Appendix 1 of textbook

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Properties of Compounds

Dimensionality Reduction Techniques (DRT)

Applied Regression Analysis. Section 2: Multiple Linear Regression

The Multiple Regression Model

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Simple linear regression

FinQuiz Notes

SMAM 314 Exam 42 Name

Chapter 7 Case Studies with Regression. Jorge Luis Romeu IIT Research Institute June 24, 1999

Lecture 5: Linear Regression

df=degrees of freedom = n - 1

Correlation and Regression

Ch 13 & 14 - Regression Analysis

Multiple regression: Model building. Topics. Correlation Matrix. CQMS 202 Business Statistics II Prepared by Moez Hababou

Model Building Chap 5 p251

Lecture 10 Multiple Linear Regression

Mathematics for Economics MA course

Lecture 9: Linear Regression

Chapter 14. Linear least squares

Chapter 15 Multiple Regression

Seawater and Ocean Chemistry

Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Regression Models. Chapter 4. Introduction. Introduction. Introduction

School of Mathematical Sciences. Question 1

Confidence Interval for the mean response

Linear models and their mathematical foundations: Simple linear regression

Understanding and Interpreting Soil and Plant Tissue Lab Reports

Page 2. Define the term electron affinity for chlorine (2)

28. SIMPLE LINEAR REGRESSION III

Multiple Linear Regression

[4+3+3] Q 1. (a) Describe the normal regression model through origin. Show that the least square estimator of the regression parameter is given by

Midterm 2 - Solutions

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

CS 5014: Research Methods in Computer Science

STA 4210 Practise set 2a

PubH 7405: REGRESSION ANALYSIS. MLR: INFERENCES, Part I

Lecture 3 questions Temperature, Salinity, Density and Circulation

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

STAT 212 Business Statistics II 1

Questions for "Reaction Bingo" 1. The starting substances in a chemical reaction.

Chemistry 222 Fall 2015 Exam 2: Chapters 5,6,7 80 Points

NEW DIAGRAM USEFUL FOR CLASSIFICATION OF GROUNDWATER QUALITY

Density Temp vs Ratio. temp

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Transition Pack for A Level Chemistry

Half Yearly Exam 2015

Biostatistics 380 Multiple Regression 1. Multiple Regression

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

ECON 450 Development Economics

****************************************************************************

Compounds. Part 1: Types of Compounds & Bonding

Additional Chapter 7 Homework Problems: Due with chapter 7 homework, show your work for full credit!

Lesson on Electrolysis

Exam practice mark scheme C2: Discovering chemistry

Interaction effects for continuous predictors in regression modeling

PART I. (a) Describe all the assumptions for a normal error regression model with one predictor variable,

SeCtiOn 7 [STOCK AND CUSTOM] Ion Chromatography Single and Multi-Element Standards

CHAPTER 4 CRITICAL GROWTH SEASONS AND THE CRITICAL INFLOW PERIOD. The numbers of trawl and by bag seine samples collected by year over the study

AP Chemistry Summer Assignment

Chapter 7. Chemical Equations and Reactions

A discussion on multiple regression models

ECON3150/4150 Spring 2015

Unit 7, Lesson 08: The ph of Salt Solutions, Answers

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Reactants: Products: Definition:

Transcription:

IieghnalCharacterimàonofWaterQua!ity(Vrooeeàiniso{ths Baltimore Symposium, May 1989). IAHSPubl.no.182,1989. analysis and multiple linear regression modeling Dr.K.S.V.Basivi Reddy Pri ncipal Kakatiya Institute of Technology?< Science, Warangal India 506 015 Dr. M. Panduranga Rao Pro-fessor o-f Engineering Geology Regional Engineering College, Warangal India 506 004 ABSTRACT Water quality data obtained from Warangal Urban agglomeration which is a hard rock terrain was subjected to multiple regression analysis. 58 wells were monitored for 5 seasons and each sample was analysed for 13 parameters. analytic studies of the body of the data obtained from the Warangal water quality analysis show that only a few factors adequately represent the traits that define the water quality. Sodium, Chloride, Ionisation, Hardness and Total Dissolved Solids arib grouped under one factor representing salinity, mineralisation of waters and pollution. Another factor is represented by potassium, calcium and magnesium reflecting the calcium magnesium dominant nature o-f Warangal waters. The third factor is covered by sulphates and fluorides indicating permanent hardness. factor is covered by ph and bicarbonates and carbonates The fourth reflecting alkalinity and temporary hardness. The fifth factor is nitrate signifying man made pollution. This analysis has been used to suggest models for predicting water quality. On the whole, it appears, that sodium either independently or in association and bicarbonates are the causal with potassium, magnesium, calcium variables for the determination of total dissolved solids. It is also seen that chlorides, either independently or in association with bicarbonates and magnesium appears to be important, variables -for the contribution of TDS. 31

K. S. V. Basivi Ready & M. Panduranga Rao 32 With the help o-f these models it. is possible to predict the water quality in any water given one predictor value? such as specific conductivity which in turn clearly indicates TDS. SIGNIFICANCE The extensive collection of data related to chemical quality of groundwaters of the Warangal urban agglomeration comprises 13 different properties which are mostly inter-correlated making the interpretation complex. analysis is a technique which tries to interpret intercorrelated variable to yield meaningful conclusions. The basic assumption in the factor analysis is that if the test battery is intercorrelated, they can be transformed suitably to yield uncorrelated derived constructs known as factors. It is possible to interpret these factors for meaningful application. The analysis is, therefore, applied to the 58 samples collected in each of the five seasons-june 1981; Oct.1981; Dec.1981; Feb.1982 and May 1982. These were duly processed for 13 parameters throughout the factor analysis. THE FACTOR MODEL analysis technique provides a mathematical model which can be used to expedite the computation of multiple regression statistics by deriving a number of variables. The principal objective of the factor analysis is to develop a parsimonious description of the observed variables, and to discover the fundamental or basic traits among them. The technique consists in accounting for the tests and their intercorrelations to determine the minimum number of uncorrelated dimensions to yield factors which convey all the essential information given by the original set of variables. These dimensions or FACTORS, in turn help in identification of basic traits or other general concepts. There are several variations in the method of solving the factors

33 analysis and multiple linear regression modelling problem. The method of principal components based on the following model is mostly advocated for data reduction jobs (Cooley, W.W and Lohness,P.R.1971). Any standardised test score 'Z' of an individual 'i ' can be considered as a linear combination of several underlying factors by a model of the type. Zji = aji + fii + aj2 f2i + aj3 f3i + ajp fpi (j = l,2,....p) Where Zji = Standard score of an individual ajp = loading of an individual i on test j on factor p fpi = The amount of uncorrelated trait measured by factor 'p' which is possessed by individual i. This method is based on the contention that 13 variables can be represented in 13 dimensional tests space model and that the loci of uniform frequency density is essentially a hyperel1ipsoid. The axes of these ellipsoids correspond to the principal component thus defines the factor or basic dimension of all the variables which &re correlated with each other. A special feature of the solution is that the first principal component is a linear combination of all variables which extracts the maximum of the total variance, the second principal component which is orthogonal with the first and further extracts the maximum out of the residual variance and so on until all the variance -is extracted. In other words, the sum of the variance of all the principal components is equal to the sum of the variance of the original variables. If it is possible to find out a set of smaller number of linear combinations or components of the original variables which account for most of the variance, then, considerable perismony is achieved. In this work the procedure suggested by Harman is adopted. THE DATA The data that is employed for applying the analysis techniques is obtained from the investigation described earlier

K. S. V. Basivi Reddy & M. Panduranga Rao 34 for di-f-ferent seasons. The primary intention for applying the factor analysis is, to study the chemical parameters involved in the above investigation, to coal ice the abstract properties o-f the waters and to identify the basic parameters influencing the water quality. The following are the variables considered. Table 1: List o-f Parameters examined Variable No. Parameter 1. 2-3. 4. 5. 6. 7. 8. 9. 10. 11. 12. ph Sodi urn Potassi urn Magnesi urn Cal ci urn Chloride Ni trate Bicarbonates + Carbonates Sulphate FIuori de Hardness Sum o-f ions 13. Total Dissolved Solids ANALYSIS A computer programme was prepared utilising the standard subroutines on Eigen vector and Vari-max rotation techniques. The data was -fed to the Integra 1001 system available at Computer Maintenance Corporation Ltd., Hyderabad and results: i) The Intercorrelated Matrix, Means and Standard deviation o-f the 13 variables for the 5 seasons ii) The Eigen values and Vectors -for the 5 seasons

35 analysis and multiple linear regression modelling iii) The Matrix -for the 5 seasons, and iv) The Rotated Matrix for the 5 seasons have been tabulated -from computer output. The tabular statements o-f these -four characteristics -for all -five seasons run into 20 tables and are lengthy. Owing to space restrictions these Bre not included in this paper. The coefficients of the Rotated Matrix indicate the correlations of the variables with the respective -factor and provide a basis for identifying their names. Generally the name selected is governed by the largest, correlations with the factors under consideration. The following variables are identified in the case of the data obtained for June 1931. I.. Variables 2,6,11,12,13 II.. Variables 3,4,5 III.. Variables 9,10 IV.. Variables 1,8 V.. Variable 7 The traits with significant coefficients for I are: Sodium (.923), Chloride (.939), Hardness (.719), Sum of ions (.926) and Total Dissolved Solids (.934) I Sodium, Chloride, Sum of ions, Hardness and T.D.S. are grouped under this factor. The sodium and Chloride reflects the salinity while the sum of ions, hardness and T.D.S. indicate the extent of mineralisation in the waters. Chloride is also indicative of pollution. II Potassium, Calcium and Magnesium srs grouped under this factor. The calcium and magnesium reflects the Ca-Mg dominant nature of waters. III Sulphate and Fluorides sre grouped under this factor. Sulphate is an indication of permanent hardness of waters.

K. S. V. Basivi Reddy & M, Panduranga Rao 36 IV ph and HCO + CO are grouped under this factor reflecting alkalinity and temporary hardness ot waters. V Nitrate has been named in this -factor signifying man made pollution. Similar analysis was done for the seasons October,1981, December,1981, February, 1982 and May,1982. Summary of the results indicate that most of the waters can be considered to have 5 different factors. Generally, I and II indicate the salinity and mineralisation. III indicates the permanent hardness. IV indicates temporary hardness. V indicates man made pollution. Therefore, it can be concluded that the Warangal waters have salinity, hardness as well as man made pollution for all the seasons in almost all the wells. REGRESSION A regression problem considers the frequency distribution of one variable when another is held fixed at each of several levels. A correlation problem considers the joint variation of two measurements, neither of which is restricted by the experiment. Correlation is a process by which the degree of association between samples of two variables X and Y is defined. The correlation coefficient is a mathematical definition of that association. The end product of the process of correlation is the correlation coefficient; it is not an equation. 1.MULTIPLE LINEAR REGRESSION The 'regression' model assumes that, some variable 'Y' responds to changes in other 'X' variables. The 'Y' variable is the quantity under study and is known as 'response' or 'dependent' variable. The X variables are those which exhibit a causal effect on the

37 analysis and multiple linear regression modelling value of the Y and Are known as the ' expl anatory ' or 'independent' variables. The model is expressed by Y = bo+blxl+...bkyk Where bo, bi... bk sre the least squares estimators of the unknown parameters, bo is the intercept, while bl... bk are regression coef f i ci ents. They are chosen in such a way as to minimise the squared sum of the residuals or deviations -from the estimated line. The major issues in the development of this model are: a) The identification of those variables which have significant and separte effects on the dependent variables. b) The model must not only provide good statistical fit to the present day date but must also be of a logical and meaningful form. c) The variables must be meaningful in explaining the dependent variable behaviour. With such an equation developed, it is possible to develop future levels of the dependent variable given future predictor indicators. The adequacy of the model can be tested by the Analysis of variance approach. The total sum of squares is decomposed into regression sum of squares and error sum of squares. SST = SSR + SSE The multifile correlation coefficient R. R = SSR/SST This indicates the degree of association between independent variable and the dependent variable. It varies between 0 and 1. Closer to 0 is worse but closer 1 is better. The significance of R is that its square R is approximately the decimal fraction of the variation of 'Y' accounted by independent variables xi, i.e. if R =0.941 then 94.17. of the total variance in the data is explained by the model.

K. S. V. Basivi Reddy & M. Panduranga Rao 38 2. 'F'TEST The regression sum of squares can be used to give some indication concerning whether or not the model is an adequate explanation of the true situation. One test is the F ratio. SSR/k F = at k, n-k-1 d.f. SSE(n-k-l) From F tables at prescribed confidence level the value of F can be known. F calculated must be more than F tables in which case the variation in Y is explained and is not by chance. 3.'t'TEST The t-statistic indicates the significance (or not) of the regression coefficient of each independent variable. Independent variables which have a 't' value of less than the table 't' value at the degrees of freedom do not have a significant relationship with the dependent variable and therefore, contribute nothing to the equation. If t-3 calculated for a parameters is 2.543 and t at 907. level at (n-k-1) degrees of freedom is 3.36 from tables, i.e. t3 calculated is less than t tabulated, the coefficient does not significantly differ from zero. Hence, variable a3 can be dropped and other combinations tried. PRECAUTIONS: The following precautions B.re to be taken in the development of linear regression models. a) Independent variables should not be intercorrelated b) All variables should be capable of clear interpretation and measurable c) The size of regression intercept in relation to the mean dependent variable Cy) is to be small. d) Signs must be logical 4. DEVELOPMENT OF LINEAR REGRESSION MODELS In the present investigation the water samples were analysed to

39 analysis and multiple linear regression modelling determine 13 different chemical properties. Sodium, Calcium, Potassium, Magnesium, Chlorides, Bicarbonates including Carbonates, Sulfates, Nitrates, Fluorides, hardness, total dissolved solids (TDS) sum of ions and ph. Since TDS is the single parameter which could re+lect the influence o-f all the dissolved constituents it is desirable to develop a model by means of which TDS could be predicted or computed given all or a -few of the twelve independent chemical constituents. In other words it is hypothesised that TDS = f<na,ca,k,mg,cl, (HC03+CD3), SQ4, NO 3, F) There could be several types of -functional -forms but multiple linear regressions is the most powerful and easily explainable model available in the literature. For successful development of such a model the causal variable, must be truly independent of each other as explained earlier. For this purpose - analysis was performed on the test battery and it was found that there are four basic dimensions which 3.re truly uncorrel ated. This was shown in Table 2. It is now proposed to utilise the result in selecting such variables which Are truly uncorrelated with a view to develop a number of regression models for possible selection. All the possible combinations were tried in this case. Regression models were developed with the help of a FORTRAN package and the models 2 were examined for R, F, T and intercept statistics. Those models which do not satisfy any of these statistics are considered as poor and hence rejected. It was interesting to note that none of the models which contained Nitrates, Sulphates, ph were found to be satisfactory. As a result of this experience it was decided to repeat the programme with a new set of independent variables. The tables containing these

K. S. V. Basivi Reddy & M. Panduranga Rao 40 Table 2: VARIABLE RECOGNISABLE TO BE ASSOCIATED WITH EACH OF THE FACTORS Seasons Vari abl e recoe qnisabl June, 1981 October, 1981 December, 1981 February, 1982 May 1982 I 2,6,11, 12,13 2,6,10, 11,12,13 2,6,11, 12,13 2,6,11 12,13 2, 6, 1 0 11,12,13 II 3,4,5 7,9 1,7 4,5 1 III 9,10 4,5 3,4 9,10 8 IV 1,8 1,8 8 3,4,5 V 7 3,7 5 s 4 s 3 s = s 4 s programmes are lengthy and are not included in view of the space restrictions in the paper. The tables exhibiting regression coefficients and the various statistics -for interpretation purposes. The recommended models based on -final Multiple Regression Analysis are as follows; June 1981: (Premonsoon and Summer) l.tds = 4.54 Na + 406 say 4.5 Na + 400 2. TDS = 3.-3 Cl + 379 say 3 Cl + 380 3. TDS = 4.56 Na + 35.3 K + 203 say 4.5 HB. + 35 l< + 200 4.TDS = 4.52 Na + 21.6 Mg + 7.8 say 4.5 Na + 22 Hg + 8 5.TDS = 4.4 Na + 6.28 Ca + 105 say 4.4 Na + 6.25 Ca +100 6.TDS = 2.97 Cl + 1.5 HC03-72.8 Say 3 CI + 1.5 HC03-75 October 1981: (Postmonsoon) l.tds = 1.86 Na + 17.86 Mg + 416 say 1.80 Na + IS Mg + 400

41 analysis and multiple linear regression modelling December 1981: (Winter) l.tds = 3.28 Na + 414 say 3.3 Na + 400 2.TDS = 2.15 Cl + 457.7 say 2.15 Cl + 460 3.TDS = 3.04 Na + 4.31 Ca + 243 say 3 Na + 4.3 Ca + 250 4.TDS = 2.06 Cl + 1.2 HC03 + 160 say 2 Cl + 1.2 HDC3 + 160 4.TDS = 3.22 Na + 14.22 Mg + 185 say 3.2 CI + 14 Mg + 190 February 1982 : (Late Winter) l.tds = 3.9 Na + 280 2.TDS = 2.4 Cl + 380 3.TDS = 2.16 Hardness + 300 4.TDS = 3.75 Ca + 17.5 Mg + 40 5.TDS = 3.5 Na + 5.5 Ca May 1982: (Summer) l.tds = 4.150 Na + 380 2.TDS = 2.5 Cl + 435 3.TDS = 3.75 Na + 5.5 Ca + 7 4. TDS = 3.75 Na + 5.5 Ca + 7 5.TDS = 2.5 Na + 1.6 HC03-60 In order to predict water quality throughout year attempts sre made for conducting regression analysis on the data -from 290 samples collected throughout the year over five seasons and the following models have been found satisfactory. TOTAL ANNUAL DATA: l.tds = 3.25 Na + 480 2.TDS = 2.25 Cl + 480 3.TDS = 3.2 Na + 18.2K + 340 4.TDS = 3.00 Na + 5.75 Ca + 140 5.TDS = 6.40 Mg + 2.20 Cl + 380 6.TDS = 2.20 Cl + 0.35 HC03 + 380 7.TDS = 3.15 Na + 18.5 Mg + 175 8.TDS = 6 Mg + 21.5 Cl + 0.334 HC03 + 295

K. S. V. Basivi Reddy & M. Panduranga Rao 42 On the whole, it appears, that sodium either independently or in association with K, Mg, Ca and bicarbonates are the causal variables for the determination o-f TDS variable. Alternatively chlorides, either independently or in association with bicarbonates and Mg also appear to be the important variables -for the contribution o-f TDS in the water with the help o-f these models. It is now possible to predict TDS in any water given one predictor value. One important application o-f this analysis is that the entire water quality can be predicted through a single simple test like speci-fic conductance which in turn clearly indicates the TDS.