Forestry 430 Advanced Biometrics and FRST 533 Problems in Statistical Methods Course Materials 2010

Size: px

Start display at page:

Download "Forestry 430 Advanced Biometrics and FRST 533 Problems in Statistical Methods Course Materials 2010"

Trevor Short
6 years ago
Views:

1 Forestr 430 Advanced Biometrics and FRST 533 Problems in Statistical Methods Course Materials 00 Instructor: Dr. Valerie LeMa, Forest Sciences 039, , Course Objectives and Overview: The objectives of this course are:. To be able to use simple linear and multiple linear regression to fit models using sample data;. To be able to design and analze lab and field experiments; 3. To be able to interpret results of model fitting and experimental analsis; and 4. To be aware of other analsis methods not explicitl covered in this course. In order to meet these objectives, background theor and examples will be used. A statistical package called SAS will be used in examples, and used to help in analzing data in exercises. Texts are also important, both to increase understanding while taking the course, and as a reference for future applied and research work. Course Content Materials: These cover most of the course materials. However, changes will be made from ear to ear, including additional examples. An additional course materials will be given as in-class handouts. NOTE: Items given in Italics are onl described briefl in this course. These course materials will be presented in class and are essential for the courses. These materials are not published and should not be used as citations for papers. Recommendations for some published reference materials, including the textbook for the course, will be listed in the course outline handed out in class. I. Short Review of Probabilit and Statistics (pp. 9-37) Descriptive statistics Inferential statistics using known probabilit distributions: normal, t, F, Chi-square, binomial, Poisson II. Fitting Equations (pp ) Dependent variable and predictor variables Purpose: Prediction and examination General examples Simple linear, multiple linear, and nonlinear regression Objectives in fitting: Least squared error or Maximum likelihood Simple Linear Regression (SLR) (pp. 4-96) Definition, notation, and example uses dependent variable () and predictor variable (x) intercept, and slope, and error Least squares solution to finding an estimated intercept and slope Derivation Normal equations Examples Assumptions of simple linear regression and properties when assumptions are met Residual plots to visuall check the assumptions that: o. Relationship is linear MOST IMPORTANT!! o. Equal variance of around x (equal spread of errors around the line) o 3. Observations are independent (not correlated in space nor time) Normalit plots to check assumption that: o 4. Normal distribution of around x (normal distribution of errors around the line) Sampling and measurement assumptions: o 5. x values are fixed o 6. random sampling of occurs for ever x

2 Transformations and other measures to meet assumptions Common Transformations for nonlinear trends, unequal variances, percents, rank transformation Outliers: unusual observations Other methods: nonlinear least squares, weighted least squares, general least squares, general linear models Measures of goodness-of-fit Graphs Coefficient of determination (r ) [and Fit Index, I ] Standard error of the estimate (SE E) [and SE E ] Estimated variances, confidence intervals and hpothesis tests For the equation For the intercept and slope For the mean of the dependent variable given a value for x For a single or group of values of the predicted dependent variable given a value for x Selecting among alternative models Process to fit an equation using least squares regression Meeting assumptions Measures of goodness-of-fit: Graphs, Coefficient of determination (r ) or I, and Standard error of the estimate (SE E) or SE E Significance of the regression Biological or logical basis and cost Multiple Linear Regression (pp ) Definition, notation, and example uses dependent variable () and predictor variables (x s) intercept, and slopes and error Least squares solution to finding an estimated intercept and slopes Least Squares and comparison to Maximum Likelihood Estimation Derivation Linear algebra to obtain normal equations; matrix algebra Examples: Calculations and SAS outputs Assumptions of multiple linear regression Residual plots to visuall check the assumptions that: o. Relationship is linear ( with ALL x s, not each x, necessaril); MOST IMPORTANT!! o. Equal variance of around x s (equal spread of errors around the surface ) o 3. Observations are independent (not correlated in space nor time) Normalit plots to check assumption that: o 4. Normal distribution of around x s (normal distribution of errors around the surface ) Sampling and measurement assumptions: o 5. x values are fixed o 6. random sampling of occurs for ever combination of x values Properties when all assumptions are met versus some are not met Transformations and other measures to meet assumptions: same as for SLR, but more difficult to select correct transformations Measures of goodness-of-fit Graphs Coefficient of multiple determination (R ) [and Fit Index, I ] Standard error of the estimate (SE E) [and SE E ] Estimated variances, confidence intervals and hpothesis tests: Calculations and SAS outputs For the regression surface For the intercept and slopes For the mean of the dependent variable given a particular value for each of the x variables For a single or group of values of the predicted dependent variable given a particular value for each of the x variables Adding class variables as predictors Dumm variables to represent a class variable Interactions to change slopes for different classes Comparing two regressions for different class levels More than one class variable (class variables as the dependent variable covered in FRST 530; under generalized linear model). 3 4

3 Methods to aid in selecting predictor (x) variables All possible regressions R criterion in SAS Stepwise methods Selecting and comparing alternative models Meeting assumptions Parsimon and cost Biological nature of the sstem modeled Measures of goodness-of-fit: Graphs, Coefficient of determination (R ) [or Fit Index, I ], and Standard error of the estimate (SE E) [or SE E ] Comparing models when some models have a transformed dependent variable Other methods using maximum likelihood criteria II. Experimental Design and Analsis (pp. 74-9) Sampling versus experiments Definitions of terms: experimental unit, response variable, factors, treatments, replications, crossed factors, randomization, sum of squares, degrees of freedom, confounding Variations in designs: number of factors, fixed versus random effects, blocking, split-plot, nested factors, subsampling, covariates Designs in use Main questions in experiments Completel Randomized Design (CRD) (pp ) Definition: no blocking and no splitting of experimental units One Factor Experiment, Fixed Effects (pp ) Main questions of interest Notation and example: observed response, overall (grand mean), treatment effect, treatment means Data organization and preliminar calculations: means and sums of squares Test for differences among treatment means: error variance, treatment effect, mean squares, F-test Assumptions regarding the error term: independence, equal variance, normalit, expected values under the assumptions Differences among particular treatment means Confidence intervals for treatment means Power of the test Transformations if assumptions are not met SAS code Two Factor Experiment, Fixed Effects (pp ) Introduction: Separating treatment effects into factor, factor and interaction between these Example laout Notation, means and sums of squares calculations Assumptions, and transformations Test for interactions and main effects: ANOVA table, expected mean squares, hpotheses and tests, interpretation Differences among particular treatment means Confidence intervals for treatment means SAS analsis for example One Factor Experiment, Random Effects Definition and example Notation and assumptions Least squares versus maximum likelihood solution Two Factor Experiment, One Fixed and One Random Effect (pp ) Introduction Example laout Notation, means and sums of squares calculations Assumptions, and transformations Test for interactions and main effects: ANOVA table, expected mean squares, hpotheses and tests, interpretation SAS code Orthogonal polnomials not covered 5 6

4 Restrictions on Randomization (pp ) Randomized Block Design (RCB) with one fixed factor (pp ) Introduction, example laout, data organization, and main questions Notation, means and sums of squares calculations Assumptions, and transformations Differences among treatments: ANOVA table, expected mean squares, hpotheses and tests, interpretation Differences among particular treatment means Confidence intervals for treatment means SAS code Randomized Block Design with other experiments (pp ) RCB with replicates in each block Two fixed factors One fixed, one random factor Incomplete Block Design Definition Examples Latin Square Design: restrictions in two directions (pp ) Definition and examples Notation and assumptions Expected mean squares Hpotheses and confidence intervals for main questions if assumptions are met Split Plot and Split-Split Plot Design (pp ) Definition and examples Notation and assumptions Expected mean squares Hpotheses and confidence intervals for main questions if assumptions are met Nested and hierarchical designs (pp ) CRD: Two Factor Experiment, Both Fixed Effects, with Second Factor Nested in the First Factor (pp ) Introduction using an example Notation Analsis methods: averages, least squares, maximum likelihood Data organization and preliminar calculations: means and sums of squares Example using SAS CRD: One Factor Experiment, Fixed Effects, with sub-sampling (pp ) Introduction using an example Notation Analsis methods: averages, least squares, maximum likelihood Data organization and preliminar calculations: means and sums of squares Example using SAS RCB: One Factor Experiment, Fixed Effects, with sub-sampling (pp ) Introduction using an example Example using SAS Adding Covariates (continuous variables) (pp ) Analsis of covariance Definition and examples Notation and assumptions Expected mean squares Hpotheses and confidence intervals for main questions if assumptions are met Allowing for Inequalit of slopes Expected Mean Squares Method to Calculate These (pp ) Method and examples Power Analsis (pp ) Concept and an example Use of Linear Mixed Models for Experimental Design (pp ) Concept and examples Summar (pp ) 7 8

5 Probabilit and Statistics Review Population vs. sample: N number of observations in the population N number of observations in the sample Experimental vs. observational studies: In experiments, we manipulate the results whereas in observational studies we simple measure what is alread there. Therefore, in experiments, we tr to assign cause and effect. Variable of interest/ dependent variable/ response variable/ outcome: Auxilliar variables/ explanator variables/ predictor variables/ independent variables/ covariates: x Observations: Measure s and x s for a census (all N) or on a sample (n out of the N) x and can be: ) continuous (ratio or interval scale); or ) discrete (nominal or ordinal scale) Descriptive Statistics: summarize the sample data as means, variances, ranges, etc. Inferential Statistics: use the sample statistics to estimate the parameters of the population 9 0

6 Parameters for populations:. Mean -- μ e.g. for N4 and 5; 6; 3 7, 4 6 μ6. Range: Maximum value minimum value 3. Standard Deviation and Variance N ( i i μ) N 4. Covariance between x and : x 5. Correlation (Pearson s) between two variables, and x: ρ ρ x x x Ranges from - to +; with strong negative correlations near to - and strong positive correlations near to Distribution for -- frequenc of each value of or x (ma be divided into classes) 7. Probabilit Distribution of or x probabilit associated with each value x N ( i μ )( xi μx) i N 8. Mode -- most common value of or x 9. Median -- -value or x-value which divides the distribution (50% of N observations are above and 50% are below)

7 Example: 50 aspen trees of Alberta Descriptive Statistics: age N50 trees Mean 7 ears Median 73 ears 5% percentile 55 75% percentile 8 Minimum 4 Maximum 60 Variance 54.7 Standard Deviation.69. Compare mean versus median. Normal distribution? Pearson correlation of age and dbh for the population of N50 trees 3 4

8 Statistics from the Sample:. Mean -- e.g. for n3 and 5; 6; 3 7, 6. Range: Maximum value minimum value 3. Standard Deviation s and Variance s s s n ( i s i ) ( n ) 4. Standard Deviation of the sample means (also called the Standard Error, short for Standard Error of the Mean) and it s square called the variance of the sample means are estimated b: s s n and s s n 5. Coefficient of variation (CV): The standard deviation from the sample, divided b the sample mean. Ma be multiplied b 00 to get CV in percent. 6. Covariance between x and : s x s x n ( i i )( x i x) ( n ) 7. Correlation (Pearson s) between two variables, and x: r r x s s x x s Ranges from - to +; with strong negative correlations near to - and strong positive correlations near to Distribution for -- frequenc of each value of or x (ma be divided into classes) 5 6

9 9. Estimated Probabilit Distribution of or x probabilit associated with each value based on the n observations Example: n50 0. Mode -- most common value of or x. Median -- -value or x-value which divides the estimated probabilit distribution (50% of N observations are above and 50% are below) 7 8

10 n50 trees Mean 69 ears Median 68 ears 5% percentile 48 75% percentile 8 Minimum 4 Maximum 60 Variance Standard Deviation 5.69 ears Standard error of the mean. ears Good estimate of population values? Pearson correlation of age and dbh 0.66 with a p-value of for the sample of n50 trees from a population of 50 trees Null and alternative hpothesis for the p-value? What is a p-value? Sample Statistics to Estimate Population Parameters: If simple random sampling (ever observation has the same chance of being selected) is used to select n from N, then: Sample estimates are unbiased estimates of their counterparts (e.g., sample mean estimates the population mean), meaning that over all possible samples the sample statistics, averaged, would equal the population statistic. A particular sample value (e.g., sample mean) is called a point estimate -- do not necessaril equal the population parameter for a given sample. Can calculate an interval where the true population parameter is likel to be, with a certain probabilit. This is a Confidence Interval, and can be obtained for an population parameter, IF the distribution of the sample statistic is known. 9 0

11 Common continuous distributions: Normal: Smmetric distribution around μ Defined b μ and. If we know that a variable has a normal distribution, and we know these parameters, then we know the probabilit of getting an particular value for the variable. Probabilit tables are for μ0 and, and are often called z-tables. Examples: P(-<z<+) 0.68; P(-.96<z<.96)0.95. Notation example: For α0.05, z α z z-scores: scale the values for b subtracting the mean, and dividing b the standard deviation. z i i μ E.g., for mean0, and standard deviation of and 0, z-5.0 (an extreme value)

12 t-distribution: Smmetric distribution Table values have the center at 0. The spread varies with the degrees of freedom. As the sample size increases, the df increases, and the spread decreases, and will approach the normal distribution. Used for a normall distributed variable whenever the variance of that variable is not known. Notation examples: t n, α where n- is the degrees of freedom, in this case, and we are looking for the α percentile. For example, for n5 and α0.05, we are looking for t with 4 degrees of freedom and the percentile (will be a value around ). Χ distribution: Starts at zero, and is not smmetric Is the square of a normall distributed variable e.g. sample variances have a Χ distribution if the variable is normall distributed Need the degrees of freedom and the percentile as with the t-distribution 3 4

13 F-distribution: Is the ratio of variables that each have a Χ distribution eg. The ratio of sample variances for variables that are each normall distributed. Need the percentile, and two degrees of freedom (one for the numerator and one for the denominator) Central Limit Theorem: As n increases, the distribution of sample means will approach a normal distribution, even if the distribution is something else (e.g. could be non-smmetric) Tables in the Textbook: Some tables give the values for probabilit distribution for the degrees of freedom, and for the percentile. Others, give this for the degrees of freedom and for the alpha level (or sometimes alpha/). Must be careful in reading probabilit tables. Confidence Intervals for a single mean: Collect data and get point estimates: o The sample mean, to estimate of the population mean μ ---- Will be unbiased o The sample variance, s to estimate of the population variance ---- Will be unbiased Can calculate interval estimates of each point estimate e.g. 95% confidence interval for the true mean o If the s are normall distributed OR o The sample size is large enough that the Central Limit Theorem holds -- will be normall distributed 5 6

14 7 ( ) ( ) ( ) ( ) large ver is when or replacemen t with replacemen t; without / them add then and value each square items all sum over where infinite is sometimes items possible of out measured items N n s s N n N n s s n n s n n N N n i i n i i n i i n i i 8 n s t s CV + /, / : population the of mean true the for Intervals Confidence 95% 00 Variation t of Coefficien α

15 Examples: n is: 4 Plot volume ba/ha ave. dbh mean: variance: std.dev.: std.dev. of mean: t should be: 3.8 Actual 95% CI (+/-): NOTE: EXCEL: %(+/-) t: not correct!!! Hpothesis Tests: Can hpothesize what the true value of an population parameter might be, and state this as null hpothesis (H0: ) We also state an alternate hpothesis (H: or Ha: ) that it is a) not equal to this value; b) greater than this value; or c) less than this value Collect sample data to test this hpothesis From the sample data, we calculate a sample statistic as a point estimate of this population parameter and an estimated variance of the sample statistic. We calculate a test-statistic using the sample estimates Under H0, this test-statistic will follow a known distribution. If the test-statistic is ver unusual, compared to the tabular values for the known distribution, then the H0 is ver unlikel and we conclude H: 9 30

16 Example for a single mean: We believe that the average weight of ravens in Yukon is kg. H0: Standard Error of the Mean: Aside: What is the CV? Test statistic: t-distribution t H: A sample of 0 birds is taken (HOW??) and each bird is weighed and released. The average bird weight is 0.8 kg, and the standard deviation was 0.0 kg. Assuming the bird weights follow a normal distribution, we can use a t-test (wh not a z-test?) Under H0: this will follow a t-distribution with df n-. Find value from t-table and compare: Mean: Variance: Conclude? 3 3

17 The p-value: Is the probabilit that we would get a value outside of the sample test statistic. Example: Comparing two means: We believe that the average weight of male ravens differs from female ravens NOTE: In EXCEL use: tdist(x,df,tails) H0: H: μ μ or μ μ 0 μ μ or μ μ 0 A sample of 0 birds is taken and each bird is weighed and released. birds were males with an average weight of. kg and a standard deviation of 0.0 kg. 8 birds were females with an average weight of 0.8 and a standard deviation of 0.0 kg. Means? Sample Variances? 33 34

18 Test statistic: Errors for Hpothesis Tests t t ( ) s 0 ( n ) s + ( n ) s n + n Under H0: this will follow a t-distribution with df (n+n-). Find t-value from tables and compare, or use the p-value: H0 True H0 False Accept -α β (Tpe II error) Reject α (Tpe I error) -β Tpe I Error: Reject H0 when it was true. Probabilit of this happening is α Tpe II Error: Accept H0 when it is false. Probabilit of this happening is β Power of the test: Reject H0 when it is false. Probabilit of this is -β Conclude? 35 36

19 What increases power? Increase sample sizes, resulting in lower standard errors A larger difference between mean for H0 and for H Increase alpha. Will decrease beta. Fitting Equations REF: Idea is : - variable of interest (dependent variable) i ; hard to measure - eas to measure variables (predictor/ independent) that are related to the variable of interest, labeled x i, x i,...x mi - measure i, x i,...x mi for a sample of n items - use this sample to estimate an equation that relates i (dependent variable) to x i,..x mi (independent or predictor variables) - once equation is fitted, one can then just measure the x s, and get an estimate of without measuring it -- also can examine relationships between variables 37 38

20 Examples: Objective:. Percent deca i ; x i logten (dbh). Logten (volume) i ; x i logten(dbh), x i logten(height) 3. Branch length i ; x i relative height above ground, x i dbh, x 3i height Find estimates of β 0, β, β... β m such that the sum of squared differences between measured i and predicted i (usuall labeled as ŷ i, values on the line or surface) is the smallest (minimize the sum of squared errors, called least squared error). Tpes of Equations Simple Linear Equation: i β o + β x i + ε i OR Find estimates of β 0, β, β... β m such that the likelihood (probabilit) of getting these values is the largest (maximize the likelihood). Multiple Linear Equation: i β 0 + β x i + β x i +...+β m x mi +ε i Nonlinear Equation: takes man forms, for example: Finding the minimum of sum of squared errors is often easier. In some cases, the lead to the same estimates of parameters. i β 0 + β x i β x i β 3 +ε i 39 40

21 Simple Linear Regression (SLR) Population: i β 0 + β x i + ε i Sample: i b 0 + b x i + e i b 0 is an estimate of β 0 [intercept] b is an estimate of β [slope] μ β + β x Y x 0 ˆ i b0 + b xi ei i ŷ i is the predicted ; an estimate of the average for for a particular x value e i is an estimate of ε i, called the error or the residual; represents the variation in the dependent variable (the ) which is not accounted for b predictor variable (the x). Find b o (intercept; i when x i 0) and b (slope) so that SSE e i (sum of squared errors over all n sample observations) is the smallest (least squares solution) The variables do not have to be in the same units. Coefficients will change with different units of measure. Given estimates of b o and b, we can get an estimate of the dependent variable (the ) for ANY value of the x, within the ranges of x s represented in the original data. i ˆ i Example: Tree Height (m) hard to measure; Dbh (diameter at.3 m above ground in cm) eas to measure use Dbh squared for a linear equation h e i g h t i Difference between measured and the mean of ˆ i i Dbh squared ˆ Difference between measured and predicted i i ˆ i ˆ ( ) ( ˆ ) i i i i Difference between predicted and mean of ˆ i ˆ i i 4 4

22 Least Squares Solution: Finding the Set of Coefficients that Minimizes the Sum of Squared Errors To find the estimated coefficients that minimizes SSE for a particular set of sample data and a particular equation (form and variables):. Define the sum of squared errors (SSE) in terms of the measured minus the predicted s (the errors);. Take partial derivatives of the SSE equation with respect to each coefficient 3. Set these equal to zero (for the minimum) and solve for all of the equations (solve the set of equations using algebra or linear algebra). For linear models (simple or multiple linear), there will be one solution. We can mathematicall solve the set of partial derivative equations. WILL ALWAYS GO THROUGH THE POINT DEFINED BY ( x, ). Will alwas result in e i 0 For nonlinear models, this is not possible and we must search to find a solution (covered in FRST 530). If we used the criterion of finding the maximum likelihood (probabilit) rather than the minimum SSE, we would need to search for a solution, even for linear models (covered FRST 530)

23 45 Least Squares Solution for SLR: Find the set of estimated parameters (coefficients) that minimize sum of squared errors ( ) + n i i i n i i x b b e SSE 0 ) ( min ) min( ) min( Take partial derivatives with respect to b 0 and b, set them equal to zero and solve. ( ) + n i i x i b b b SSE 0 0 ) ( n i i n i i n i i n i i n i i n i n i i x n b n b x b nb x b b b x b 0 46 ( ) + n i i i i x b b x b SSE 0 ) ( n i i i n i n i i i n i i i n i n i i i i n i n i i i n i i n i i i n i n i i i x x x b x b x x b x b x b x x b x b x b x ) ( 0 With some further manipulations: ( )( ) ( ) SSx SPx n s n s x x x x b x x n i i n i i i ) ( ) ( Where SPx refers to the corrected sum of cross products for x and ; SSx refers to the corrected sum of squares for x [Class example]

24 Properties of b 0 and b b 0 and b are least squares estimates of β 0 and β. Under assumptions concerning the error term and sampling/ measurements, these are: Unbiased estimates; given man estimates of the slope and intercept for all possible samples, the average of the sample estimates will equal the true values The variabilit of these estimates from sample to sample can be estimated from the single sample; these estimated variances will be unbiased estimates of the true variances (and standard errors) The estimated intercept and slope will be the most precise (most efficient with the lowest variances) estimates possible (called Best ) These will also be the maximum likelihood estimates of the intercept and slope Assumptions of SLR Once coefficients are obtained, we must check the assumptions of SLR. Assumptions must be met to: obtain the desired characteristics assess goodness of fit (i.e., how well the regression line fits the sample data) test significance of the regression and other hpotheses calculate confidence intervals and test hpothesis for the true coefficients (population) calculate confidence intervals for mean predicted value given a set of x value (i.e. for the predicted given a particular value of the x) Need good estimates (unbiased or at least consistent) of the standard errors of coefficients and a known probabilit distribution to test hpotheses and calculate confidence intervals

25 Checking assumptions using residual Plots Assumptions of : Residual plot that meets the assumptions of a linear relationship, and equal variance of the observations:. a linear relationship between the and the x;. equal variance of errors; and 3. independence of errors (independent observations) can be visuall checked b using RESIDUAL PLOTS A residual plot shows the residual (i.e., i - ŷ i ) as the -axis and the predicted value ( ŷ i ) as the x-axis. The data points are evenl distributed about zero and there are no outliers (ver unusual points that ma be a measurement or entr error). For independence: Residual plots can also indicate unusual points (outliers) that ma be measurement errors, transcription errors, etc

26 Examples of Residual Plots Indicating Failures to Meet Assumptions:. The relationship between the x s and is linear. If not. The variance of the values must be the same for ever one of the x values. If not met, the spread around the line will not be even. met, the residual plot and the plot of vs. x will show a curved line: ht 60 ˆ 50 ˆ 40 ˆ 30 ˆ ˆ ˆ 4 0 ˆ Šˆ ˆ ˆ ˆ ˆ ˆ dbhsq Residual 5 ˆ * * * * * * 0 ˆ * * *** * *** *** *** ** ** * *** ***** *** *** * *** * 5 ˆ ********* ** ******* * **** * ** ** * ****** *** ** * ***** ** * ** *** * * * ***** ** * * * **** 0 ˆ ******** ** *** ******** * * * * ** ******* * * ******* ** * * * ******* * * * -5 ˆ * *** * * * * * * ** ** * * ** * ** * * * ** ** -0 ˆ * * * * * * * -5 ˆ * * -0 ˆ Š-ˆ ˆ ˆ ˆ ˆ ˆ ˆ Predicted Value of ht Result: If this assumption is not met, the estimated coefficients (slopes and intercept) will be unbiased, but the estimates of the standard deviation of these coefficients will be biased. we cannot calculate CI nor test the significance of the x Result: If this assumption is not met: the regression line does not fit the data well; biased estimates of coefficients variable. However, estimates of the coefficients of the regression line and goodness of fit are still unbiased and standard errors of the coefficients will occur 5 5

27 3. Each observation (i.e., x i and i ) must be independent of all other observations. In this case, we produce a different residual plot, where the residuals are on the -axis as before, but the x-axis is the variable that is thought to produce the dependencies (e.g., time). If not met, this revised residual plot will show a trend, indicating the residuals are not independent. we cannot calculate CI nor test the significance of the x variable. However, estimates of the coefficients of the regression line and goodness of fit are still unbiased Normalit Histogram or Plot A fourth assumption of the SLR is: 4. The values must be normall distributed for each of the x values. A histogram of the errors, and/or a normalit plot can be used to check this, as well as tests of normalit Result: If this assumption is not met, the estimated coefficients (slopes and intercept) will be unbiased, but the estimates of the standard deviation of these coefficients will be biased. Histogram # Boxplot 0.5+* 0.*.*.*.**** 8.******* 4.************** 7.******************** 40.***************************** ************************** 5.****************************** 60 *--+--* -0.5+***************************** 58.************************* 49.***************** ************** 8.************ 4.***********.**** 7.**** 7.*** 5..* ** HO: data are normal H: data are not normal Tests for Normalit 53 54

28 Test --Statistic p Value Shapiro-Wilk W Pr < W Kolmogorov-Smirnov D Pr > D Cramer-von Mises W-Sq Pr > W-Sq Anderson-Darling A-Sq Pr > A-Sq < Normal Probabilit Plot 0.5+ * * +** +++** +**** +**** ***** **** ***** **** **** **** ***+ **** *** +*** ***** +** +*** +**** * -.5+* Result: We cannot calculate CI nor test the significance of the x variable, since we do not know what probabilities to use. Also, estimated coefficients are no longer equal to the maximum likelihood solution. Example: Residual Frequenc Normal Plot of Residuals Normal Score Histogram of Residuals 0 Residual Volume versus dbh Residual Residual I Chart of Residuals Observation Number 0 5 Fit Residuals vs. Fits UCL.63 X0.000 LCL

29 Measurements and Sampling Assumptions The remaining assumptions are based on the measurements and collection of the sampling data. 5. The x values are measured without error (i.e., the x values are fixed). This can onl be known if the process of collecting the data is known. For example, if tree diameters are ver precisel measured, there will be little error. If this assumption is not met, the estimated coefficients (slopes and intercept) and their variances will be biased, since the x values are varing. 6. The values are randoml selected for value of the x variables (i.e., for each x value, a list of all possible values is made, and some are randoml selected). For man biological problems, the observations will be gathered using simple random sampling or sstematic sampling (grid across the land area). This does not strictl meet this assumption. Also, more complex sampling design such as multistage sampling (sampling large units and sampling smaller units within the large units), this assumption is not met. If the equation is correct, then this does not cause problems. If not, the estimated equation will be biased

30 Transformations Outliers: Unusual Points Common Transformations Powers x 3, x 0.5, etc. for relationships that look nonlinear log0, loge also for relationships that look nonlinear, or when the variances of are not equal around the line Sin- [arcsine] when the dependent variable is a proportion. Rank transformation: for non-normal data o Sort the variable o Assign a rank to each variable from to n o Transform the rank to normal (e.g., Blom Transformation) PROBLEM: loose some of the information in the original data Tr to transform x first and leave i variable of interest; however, this is not alwas possible. Use graphs to help choose transformations Check for points that are quite different from the others on: Graph of versus x Residual plot Do not delete the point as it MAY BE VALID! Check: Is this a measurement error? E.g., a tree height of 00 m is ver unlikel Is a transcription error? E.g. for adult person, a weight of 0 lbs was entered rather than 00 lbs. Is there something ver unusual about this point? e.g., a bird has a short beak, because it was damaged. Tr to fix the observation. If it is ver different than the others, or ou know there is a measurement error that cannot be fixed, then delete it and indicate this in our research report. On the residual plot, an outlier CAN occur if the model is not correct ma need a transformation of the variable(s), or an important variable is missing 59 60

31 Other methods, than SLR (and Multiple Linear Regression), when transformations do not work (some covered in FRST 530): Nonlinear least squares: Least squares solution for nonlinear models; uses a search algorithm to find estimated coefficients; has good properties for large datasets; still assumes normalit, equal variances, and independent observations Weighted least squares: for unequal variances. Estimate the variances and use these in weighting the least squares fit of the regression; assumes normalit and independent observations Measures of Goodness of Fit How well does the regression fit the sample data? For simple linear regression, a graph of the original data with the fitted line marked on the graph indicates how well the line fits the data [not possible with MLR] Two measures commonl used: coefficient of determination (r ) and standard error of the estimate(se E ). Generalized linear model: used for distributions other than normal (e.g., binomial, Poisson, etc.), but with no correlation between observations; uses maximum likelihood Generalized least Squares and Mixed Models: use maximum likelihood for fitting models with unequal variances, correlations over space, correlations over time, but normall distributed errors Generalized linear mixed models: Allows for unequal variances, correlations over space and/or time, and non-normal distributions; uses maximum likelihood 6 6

32 To calculate r and SE E, first, calculate the SSE (this is what was minimized): SSE n i e i n i n ( ˆ ) ( ( b + b x )) i i i The sum of squared differences between the measured and estimated s. Calculate the sum of squares for : n n n SS i i i i i ( ) n s ( n ) The sum of squared difference between the measured and the mean of -measures. NOTE: In some texts, this is called the sum of squares total. Calculate the sum of squares regression: n SSreg i ( ˆ ) b SPx SS SSE i The sum of squared differences between the mean of - measures and the predicted s from the fitted equation. Also, is the sum of squares for the sum of squared errors. i i 0 i SS SSE SSE SSreg Then: r SS SS SS And: SSE, SSY are based on s used in the equation will not be in original units if was transformed r coefficient of determination; proportion of variance of, accounted for b the regression using x Is the square of the correlation between x and O (ver poor horizontal surface representing no relationship between and x s) to (perfect fit surface passes through the data) SSE SE E n SSE is based on s used in the equation will not be in original units if was transformed SE E - standard error of the estimate; in same units as Under normalit of the errors: o ± SE E 68% of sample observations o ± SE E 95% of sample observations o Want low SEE 63 64

33 -variable was transformed: Can calculate estimates of these for the original -variable unit, called I (Fit Index) and estimated standard error of the estimate (SE E ), in order to compare to r and SE E of other equations where the was not transformed. I - SSE/SSY where SSE, SSY are in original units. NOTE must back-transform the predicted s to calculate the SSE in original units. Does not have the same properties as r, however: o it can be less than 0 o it is not the square of the correlation between the (in original units) and the x used in the equation. Estimated standard error of the estimate (SE E ), when the dependent variable,, has been transformed: SSE( original units) SE E ' n SE E - standard error of the estimate ; in same units as original units for the dependent variable want low SE E [Class example] Estimated Variances, Confidence Intervals and Hpothesis Tests Testing Whether the Regression is Significant Does knowledge of x improve the estimate of the mean of? Or is it a flat surface, which means we should just use the mean of as an estimate of mean for an x? SSE/ (n-): Called the Mean squared error, as would be the average of the squared error if we divided b n. Instead, we divide b n-. Wh? The degrees of freedom are n-; n observations with two statistics estimated from these, b 0 and b Under the assumptions of SLR, is an unbiased estimated of the true variance of the error terms (error variance) SSR/: Called the Mean Square Regression Degrees of Freedom: x-variable Under the assumptions of SLR, this is an estimate the error variance PLUS a term of variance explained b the regression using x

34 H0: Regression is not significant H: Regression is significant Same as: H0: β 0 [true slope is zero meaning no relationship with x] H: β 0 [slope is positive or negative, not zero] This can be tested using an F-test, as it is the ratio of two variances, or with a t-test since we are onl testing one coefficient (more on this later) Information for the F-test is often shown as an Analsis of Variance Table: Source df SS MS F p-value Regression MSreg F Prob F> SSreg SSreg/ MSreg/MSE F (,n-,- α) Residual n- SSE MSE SSE/(n-) Total n- SS [Class example and explanation of the p-value] Using an F test statistic: SSreg F SSE ( n ) MSreg MSE Under H0, this follows an F distribution for a - α/ percentile with and n- degrees of freedom. If the F for the fitted equation is larger than the F from the table, we reject H0 (not likel true). The regression is significant, in that the true slope is likel not equal to zero

35 Estimated Standard Errors for the Slope and Intercept Under the assumptions, we can obtain an unbiased estimated of the standard errors for the slope and for the intercept [measure of how these would var among different sample sets], using the one set of sample data. s s b0 b x MSE + n SSx MSE SSx MSE n i n SSx Confidence Intervals for the True Slope and Intercept Under the assumptions, confidence intervals can be calculated as: For β o : b 0 ± t α, n sb0 x i Hpothesis Tests for the True Slope and Intercept H0: β c [true slope is equal to the constant, c] H: β c [true slope differs from the constant c] Test statistic: b t c s b Under H0, this is distributed as a t value of t c t n-, -α/. Reject H o if t > t c. The procedure is similar for testing the true intercept for a particular value It is possible to do one-sided hpotheses also, where the alternative is that the true parameter (slope or intercept) is greater than (or less than) a specified constant c. MUST be careful with the t c as this is different. [class example] b t s ± For β : α, n b [class example] 69 70

36 Confidence Interval for the True Mean of given a particular x value For the mean of all possible -values given a particular value of x (μ x h ): where ˆ x s ˆ xh ˆ xh ± tn, α sˆ x h h b 0 + b x h MSE + n ( x x) h SSx Confidence Bands Plot of the confidence intervals for the mean of for several x-values. Will appear as:

37 Confidence Interval for or more -values given a particular x value For one possible new -value given a particular value of x: Where ˆ s ˆ ( new) xh ± tn, α sˆ ( new) x h ( new) x ˆ ( new) xh h b 0 + b x MSE + + n h ( x x) h SSx For the average of g new possible -values given a particular value of x: where ˆ ( new) ˆ ( new) xh ± tn, α sˆ ( newg ) x h x h b 0 + b x h Selecting Among Alternative Models Process to Fit an Equation using Least Squares Steps:. Sample data are needed, on which the dependent variable and all explanator (independent) variables are measured.. Make an transformations that are needed to meet the most critical assumption: The relationship between and x is linear. Example: volume β 0 + β dbh ma be linear whereas volume versus dbh is not. Use i volume, x i dbh. 3. Fit the equation to minimize the sum of squared error. 4. Check Assumptions. If not met, go back to Step. s [class example] ˆ ( new g ) xh MSE + + g n ( x x) h SSx 5. If assumptions are met, then interpret the results. Is the regression significant? What is the r? What is the SE E? Plot the fitted equation over the plot of versus x

38 For a number of models, select based on: Simple Linear Regression Example. Meeting assumptions: If an equation does not meet the assumption of a linear relationship, it is not a candidate model. Compare the fit statistics. Select higher r (or I ), and lower SE E (or SE E ) 3. Reject an models where the regression is not significant, since this model is no better than just using the mean of as the predicted value. 4. Select a model that is biologicall tractable. A simpler model is generall preferred, unless there are practical/biological reasons to select the more complex model 5. Consider the cost of using the model [class example] Temperature (x) Weight () Weight () Observation temp weight Et cetera Weight () 75 76

39 weight versus temperature Obs. temp weight x-diff x-diff. sq Et cetera weight 30 0 mean SSX,8.5 SSY3,9.8 SPXY6, temperature SPx b b0 b x SSx b: b0: NOTE: calculate b first, since this is needed to calculate b

40 From these, the residuals (errors) for the equation, and the sum of squared error (SSE) were calculated: residual Obs. weight -pred residual sq Et cetera SSE: And SSRSSY-SSE F with p0.00 (ver small) In excel use: fdist(x,df,df) to obtain a p-value r : 0.97 Root MSE Or SE E :.57 BUT: Before interpreting the ANOVA table, Are assumptions met? ANOVA Source df SS MS Model Error Total

41 residuals ( erro rs) residual plot predicted weight Normalit plot: Obs. sorted Stand. Rel. Prob. resids resids Freq. z- dist Etc. Linear? Equal variance? Independent observations? 8 8

42 cum ulative probabilit Probabilit plot z-value relative frequenc Prob. z-dist. Questions:. Are the assumptions of simple linear regression met? Evidence?. If so, interpret if this is a good equation based on goodness of it measures

43 3. Is the regression significant? For 95% confidence intervals for b0 and b, would also need estimated standard errors: s b 0 x MSE + n SSx s b MSE SSx The t-value for 6 degrees of freedom and the percentile is. (tinv(0.05,6) in EXCEL) 85 86

44 b 0 ± t α, n s For β o : 5.85 ± b t ± α, n s For β : ± Est. Coeff St. Error For b0: For b: CI: b0 b t(0.975,6).. lower upper Question: Could the real intercept be equal to 0? b b 0 Given a temperature of, what is the estimated average weight (predicted value) and a 95% confidence interval for this estimate? ˆ x ˆ ( x s s ˆ x h ˆ x h h b h 0 + b x h ) MSE + n ( x x) + h SSx ( 37.5) 8.50 ˆ xh ± tn, α sˆ x h 87 88

45 Given a temperature of, what is the estimated weight for an new observation, and a 95% confidence interval for this estimate? ˆ x ˆ ( h x h b 0 + b x h ) If assumptions were not met, we would have to make some transformations and start over again! s s ˆ x h ˆ x h MSE + + n ( x x) + h SSx ( 37.5) ˆ xh ± tn, sˆ α x h

46 SAS code: * wttemp.sas ; options ls70 ps50; run; DATA regdata; input temp weight; cards; run; DATA regdata; set regdata; tempsqtemp**; tempcubtemp**3; logtemplog(temp); run; Proc plot dataregdata; plot weight*(temp tempsq logtemp)'*'; run; * ; PROC REG dataregdata simple; model weighttemp; output outout phat rresid; run; * ; PROC PLOT DATAout; plot resid*hat; run; * ; PROC univariate dataout plot normal; Var resid; Run; 9 9

47 SAS outputs: ) Graphs which appears more linear? ) How man observations were there? 3) What is the mean weight? Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Intercept temp Variable t Value Pr > t Intercept 5.4 <.000 temp 3.98 <.000 The REG Procedure Model: MODEL Dependent Variable: weight Number of Observations Read 8 Number of Observations Used 8 Analsis of Variance Sum of Mean Source DF Squares Square F Value Model Error Corr. Total Source F Value Pr > F Model <.000 Error Corrected Total Root MSE.5760 R-Square Dependent Mean 7. Adj R-Sq 0.97 Coeff Var

48 Plot of resid*hat. Legend: A obs, B obs, etc. 6 ˆ 4 ˆ A A B ˆ A A A R e A s i A d 0 ˆ u A A A a l A - ˆ A A A -4 ˆ A A -6 ˆ Tests for Normalit Test --Statistic--- --p Value--- Shapiro-Wilk W Pr<W Kolmogorov-Smirnov D Pr>D >0.500 Cramer-von Mises W-Sq Pr>W-Sq >0.500 Anderson-Darling A-Sq Pr>A-Sq >0.500 The UNIVARIATE Procedure Variable: resid (Residual) Normal Probabilit Plot * * **++++*.5+ * *++++ +*+++ **+** * * ++++ * *+ * Predicted Value of weight 95 96

49 Multiple Linear Regression (MLR) For example: Population: i β 0 + β x i + β x i +...+β p x mi +ε i Sample: i b 0 + b x i + b x i +...+b p x mi +e i ˆ i b0 + b x i + b xi + K + bm xmi ei i β o is the intercept parameter β, β, β 3,..., β m are slope parameters x i, x i, x 3i... x mi independent variables ε i - is the error term or residual - is the variation in the dependent variable (the ) which is not accounted for b the independent variables (the x s). For an fitted equation (we have the estimated parameters), we can get the estimated average for the dependent variable, for an set of x s. This will be the predicted value for, which is the estimated average of, given the particular values for the x variables. NOTE: In text b Kutner et al. pm+. This is not be confused with the p- value indicating significance in hpothesis tests. ˆ i Predicted log0(vol) X log0(dbh) +. X log0(height) where b o -4.; b. ; b. estimated b finding the least squared error solution. Using this equation for dbh 30 cm, height8m, logten(dbh).48, logten(height).45; logten(vol) volume (m 3 ) This represents the estimated average volume for trees with dbh30 cm and height8 m. Note: This equation is originall a nonlinear equation: b vol a dbh Which was transformed to a linear equation using logarithms: log 0( vol) log0( a) + b log0( dbh) + c log0( ht) + log0ε And this was fitted using multiple linear regression ht c ε 97 98

50 For the observations in the sample data used to fit the regression, we can also get an estimate of the error (we have measured volume). If the measured volume for this tree was m 3, or in log0 units: error i ˆ i For the fitted equation using log0 units. In original units, the estimated error is NOTE: This is not simpl the antilog of Finding the Set of Coefficients that Minimizes the Sum of Squared Errors Same process as for SLR: Find the set of coefficients that results in the minimum SSE, just that there are more parameters, therefore more partial derivative equations and more equations o E.g., with 3 x-variables, there will be 4 coefficients (intercept plus 3 slopes) so four equations For linear models, there will be one unique mathematical solution. For nonlinear models, this is not possible and we must search to find a solution Using the criterion of finding the maximum likelihood (probabilit) rather than the minimum SSE, we would need to search for a solution, even for linear models (covered in other courses, e.g., FRST 530)

Definitions of terms and examples. Experimental Design. Sampling versus experiments. For each experimental unit, measures of the variables of

Definitions of terms and examples. Experimental Design. Sampling versus experiments. For each experimental unit, measures of the variables of Experimental Design Sampling versus experiments similar to sampling and inventor design in that information about forest variables is gathered and analzed experiments presuppose intervention through appling