Regression. Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables X and Y.

Regression Bivariate i linear regression: Estimation of the linear function (straight line) describing the linear component of the joint relationship between two variables and. Generally describe as a function of. Covariance and correlation measure strength of the relationship between and. Covariance depends on units of and. Correlation is dimensionless. Thus regression line and correlation (or covariance) provide complementary information about joint distribution.

Linear components of relationships: Represented by linear equations (involving 1 ): Two parameters (pieces of information) needed to uniquely specify a line: Two points One point and slope..... Two comparable formulations: Geometric: m b Statistical: b0 b1

(1) Geometric formulation: m b m = slope of line (change in relative to change in ). Can be determined from any two points ( i, i ) and ( j, j ) lying on the line: i j j i m i j j i ( i, i ) b = -intercept, value of when = 0. If know slope (m), can find intercept. Substitute it, and the coordinates of any point on the line, into the equation, solve for b: 0 In particular: b m ( j, j ) i m i b b m i i

() Statistical formulation: b0 b1 b 0 = -intercept b 1 = slope This form extrapolates to polynomials of higher order: k b b b b 0 1 Statistics b 0 and b 1 are assumed to be sample estimates of the population parameters 0 and 1. Estimated with error, From a sample of individuals that don t lie exactly on a straight line:. Express individual id residual variation. i.» Doesn t fit the linear model.. Comprises two sources of variation:. (a) Individual biological variation.. (b) Measurement error. k

Response (dependent) variable True slope 1 Residual variation Mean response True } linear relationship True intercept 1 0 0 0 0. 1 1.... Predictor (independent) variable Mean residual

There are an infinite number of different regression models. Correspond to differing assumptions made about the direction of residual variation. Two most common models: (1) Predictive (type I). () Major axis (type II).

Predictive regression (1a) Predictive regression of on : The true population relationship is assumed to be: i 0 1 i i i i 0 1 i All residual variation is assumed dto be concentrated t din, and none in (vertical residuals). Arises from analysis of variance (ANOVA): Purpose: test whether means for two or more groups (k groups) differ significantly. Groups () assumed to be known, with no error. Within-group residual variation (in ) assumed to be normally distributed. All groups assumed to have the same amount of variation.

Step from ANOVA to regression: Assume that groups can be represented by the (exact) value of a predictor (independent) variable,. Ask whether there is a linear trend in response () among groups. Appropriate in experimental studies in which predictor variable is set by the investigator, with no error. 0. A 1 A A 0. 0. Frequ uency 0. 0. 0. 0. 01 0.1 1 0 1 0 1 A

Thus and have different roles in the regression analysis: is the predictor or independent variable. Assigned values independently of the model, with no error. is the response or dependent variable. Assumed dto be dependent d on the value of f, with some individual (residual) variation in response. Estimation: s b1, b0 b1 s r = 0. b 1 = 0. 10 11 r = 0. b 1 = 0. 10 11

(1b) Predictive regression of on : Reverse role of and. True population relationship is then: i 0 1 i i. i i 0 1 i.... All residual variation is assumed to be concentrated in, and none in (=horizontal residuals). s Estimation: d, 1 d 0 d 1 s

Major-axis regression () Major axis regression: The two predictive-regression models above are extreme cases: Predictive regression of on assumes that var( i )=0; i.e., that all residual variation is in. Predictive regression of on assumes that var( i )=0; i.e., that all residual variation is in. Instead of assuming that t all residual variation ( error ) is in either or, assume that both variables have some residual variation. If knew amounts of residual variation, could express their relative errors as a ratio: var i var i

So: var var i i For any value of, we can estimate the corresponding slope of a unique regression line: m s s s s s b m s This general case is called functional regression.

There are an infinite number of functional-regression lines, corresponding to values of. In general, don t know how much residual variation is in versus. If treating and equivalently, reasonable assumption: they have approximately equal amounts of residual iti variation: var i var i var i 1 var Then slope reduces to: i m s s s s s s This special case is major-axis regression: Regression line equivalent to major axis of the confidence ellipse for the data.

Major-axis regression line is always intermediate in slope between predictive regression of on and that of on. More precisely: major axis slope is geometric mean of the two predictive-regression slopes: m b 1 d 1 (d 1 is converted to reciprocal so that all three slopes describe change in relative to change in.)

Predictive and major-axis regressions 10 r=0 0. 10 r=0 0. 10 1 10 10 11 r=00 0.0 r=00 0.0 on MA on 1 10 11 0 1 10

Predictive and major-axis regressions 10 r=00 0.0 r=00 0.0 10 11 1 0 10 00 000 on 10 r = 0.00 MA r = 0.0 on 10 11 10 11

Major-axis regression is the bivariate case of principal component analysis. Corresponds to case in which residuals are assumed to be orthogonal to the regression line. Can be extended to > variables.