Forestry 430 Advanced Biometrics and FRST 533 Problems in Statistical Methods Course Materials 2010
|
|
- Trevor Short
- 6 years ago
- Views:
Transcription
1 Forestr 430 Advanced Biometrics and FRST 533 Problems in Statistical Methods Course Materials 00 Instructor: Dr. Valerie LeMa, Forest Sciences 039, , Course Objectives and Overview: The objectives of this course are:. To be able to use simple linear and multiple linear regression to fit models using sample data;. To be able to design and analze lab and field experiments; 3. To be able to interpret results of model fitting and experimental analsis; and 4. To be aware of other analsis methods not explicitl covered in this course. In order to meet these objectives, background theor and examples will be used. A statistical package called SAS will be used in examples, and used to help in analzing data in exercises. Texts are also important, both to increase understanding while taking the course, and as a reference for future applied and research work. Course Content Materials: These cover most of the course materials. However, changes will be made from ear to ear, including additional examples. An additional course materials will be given as in-class handouts. NOTE: Items given in Italics are onl described briefl in this course. These course materials will be presented in class and are essential for the courses. These materials are not published and should not be used as citations for papers. Recommendations for some published reference materials, including the textbook for the course, will be listed in the course outline handed out in class. I. Short Review of Probabilit and Statistics (pp. 9-37) Descriptive statistics Inferential statistics using known probabilit distributions: normal, t, F, Chi-square, binomial, Poisson II. Fitting Equations (pp ) Dependent variable and predictor variables Purpose: Prediction and examination General examples Simple linear, multiple linear, and nonlinear regression Objectives in fitting: Least squared error or Maximum likelihood Simple Linear Regression (SLR) (pp. 4-96) Definition, notation, and example uses dependent variable () and predictor variable (x) intercept, and slope, and error Least squares solution to finding an estimated intercept and slope Derivation Normal equations Examples Assumptions of simple linear regression and properties when assumptions are met Residual plots to visuall check the assumptions that: o. Relationship is linear MOST IMPORTANT!! o. Equal variance of around x (equal spread of errors around the line) o 3. Observations are independent (not correlated in space nor time) Normalit plots to check assumption that: o 4. Normal distribution of around x (normal distribution of errors around the line) Sampling and measurement assumptions: o 5. x values are fixed o 6. random sampling of occurs for ever x
2 Transformations and other measures to meet assumptions Common Transformations for nonlinear trends, unequal variances, percents, rank transformation Outliers: unusual observations Other methods: nonlinear least squares, weighted least squares, general least squares, general linear models Measures of goodness-of-fit Graphs Coefficient of determination (r ) [and Fit Index, I ] Standard error of the estimate (SE E) [and SE E ] Estimated variances, confidence intervals and hpothesis tests For the equation For the intercept and slope For the mean of the dependent variable given a value for x For a single or group of values of the predicted dependent variable given a value for x Selecting among alternative models Process to fit an equation using least squares regression Meeting assumptions Measures of goodness-of-fit: Graphs, Coefficient of determination (r ) or I, and Standard error of the estimate (SE E) or SE E Significance of the regression Biological or logical basis and cost Multiple Linear Regression (pp ) Definition, notation, and example uses dependent variable () and predictor variables (x s) intercept, and slopes and error Least squares solution to finding an estimated intercept and slopes Least Squares and comparison to Maximum Likelihood Estimation Derivation Linear algebra to obtain normal equations; matrix algebra Examples: Calculations and SAS outputs Assumptions of multiple linear regression Residual plots to visuall check the assumptions that: o. Relationship is linear ( with ALL x s, not each x, necessaril); MOST IMPORTANT!! o. Equal variance of around x s (equal spread of errors around the surface ) o 3. Observations are independent (not correlated in space nor time) Normalit plots to check assumption that: o 4. Normal distribution of around x s (normal distribution of errors around the surface ) Sampling and measurement assumptions: o 5. x values are fixed o 6. random sampling of occurs for ever combination of x values Properties when all assumptions are met versus some are not met Transformations and other measures to meet assumptions: same as for SLR, but more difficult to select correct transformations Measures of goodness-of-fit Graphs Coefficient of multiple determination (R ) [and Fit Index, I ] Standard error of the estimate (SE E) [and SE E ] Estimated variances, confidence intervals and hpothesis tests: Calculations and SAS outputs For the regression surface For the intercept and slopes For the mean of the dependent variable given a particular value for each of the x variables For a single or group of values of the predicted dependent variable given a particular value for each of the x variables Adding class variables as predictors Dumm variables to represent a class variable Interactions to change slopes for different classes Comparing two regressions for different class levels More than one class variable (class variables as the dependent variable covered in FRST 530; under generalized linear model). 3 4
3 Methods to aid in selecting predictor (x) variables All possible regressions R criterion in SAS Stepwise methods Selecting and comparing alternative models Meeting assumptions Parsimon and cost Biological nature of the sstem modeled Measures of goodness-of-fit: Graphs, Coefficient of determination (R ) [or Fit Index, I ], and Standard error of the estimate (SE E) [or SE E ] Comparing models when some models have a transformed dependent variable Other methods using maximum likelihood criteria II. Experimental Design and Analsis (pp. 74-9) Sampling versus experiments Definitions of terms: experimental unit, response variable, factors, treatments, replications, crossed factors, randomization, sum of squares, degrees of freedom, confounding Variations in designs: number of factors, fixed versus random effects, blocking, split-plot, nested factors, subsampling, covariates Designs in use Main questions in experiments Completel Randomized Design (CRD) (pp ) Definition: no blocking and no splitting of experimental units One Factor Experiment, Fixed Effects (pp ) Main questions of interest Notation and example: observed response, overall (grand mean), treatment effect, treatment means Data organization and preliminar calculations: means and sums of squares Test for differences among treatment means: error variance, treatment effect, mean squares, F-test Assumptions regarding the error term: independence, equal variance, normalit, expected values under the assumptions Differences among particular treatment means Confidence intervals for treatment means Power of the test Transformations if assumptions are not met SAS code Two Factor Experiment, Fixed Effects (pp ) Introduction: Separating treatment effects into factor, factor and interaction between these Example laout Notation, means and sums of squares calculations Assumptions, and transformations Test for interactions and main effects: ANOVA table, expected mean squares, hpotheses and tests, interpretation Differences among particular treatment means Confidence intervals for treatment means SAS analsis for example One Factor Experiment, Random Effects Definition and example Notation and assumptions Least squares versus maximum likelihood solution Two Factor Experiment, One Fixed and One Random Effect (pp ) Introduction Example laout Notation, means and sums of squares calculations Assumptions, and transformations Test for interactions and main effects: ANOVA table, expected mean squares, hpotheses and tests, interpretation SAS code Orthogonal polnomials not covered 5 6
4 Restrictions on Randomization (pp ) Randomized Block Design (RCB) with one fixed factor (pp ) Introduction, example laout, data organization, and main questions Notation, means and sums of squares calculations Assumptions, and transformations Differences among treatments: ANOVA table, expected mean squares, hpotheses and tests, interpretation Differences among particular treatment means Confidence intervals for treatment means SAS code Randomized Block Design with other experiments (pp ) RCB with replicates in each block Two fixed factors One fixed, one random factor Incomplete Block Design Definition Examples Latin Square Design: restrictions in two directions (pp ) Definition and examples Notation and assumptions Expected mean squares Hpotheses and confidence intervals for main questions if assumptions are met Split Plot and Split-Split Plot Design (pp ) Definition and examples Notation and assumptions Expected mean squares Hpotheses and confidence intervals for main questions if assumptions are met Nested and hierarchical designs (pp ) CRD: Two Factor Experiment, Both Fixed Effects, with Second Factor Nested in the First Factor (pp ) Introduction using an example Notation Analsis methods: averages, least squares, maximum likelihood Data organization and preliminar calculations: means and sums of squares Example using SAS CRD: One Factor Experiment, Fixed Effects, with sub-sampling (pp ) Introduction using an example Notation Analsis methods: averages, least squares, maximum likelihood Data organization and preliminar calculations: means and sums of squares Example using SAS RCB: One Factor Experiment, Fixed Effects, with sub-sampling (pp ) Introduction using an example Example using SAS Adding Covariates (continuous variables) (pp ) Analsis of covariance Definition and examples Notation and assumptions Expected mean squares Hpotheses and confidence intervals for main questions if assumptions are met Allowing for Inequalit of slopes Expected Mean Squares Method to Calculate These (pp ) Method and examples Power Analsis (pp ) Concept and an example Use of Linear Mixed Models for Experimental Design (pp ) Concept and examples Summar (pp ) 7 8
5 Probabilit and Statistics Review Population vs. sample: N number of observations in the population N number of observations in the sample Experimental vs. observational studies: In experiments, we manipulate the results whereas in observational studies we simple measure what is alread there. Therefore, in experiments, we tr to assign cause and effect. Variable of interest/ dependent variable/ response variable/ outcome: Auxilliar variables/ explanator variables/ predictor variables/ independent variables/ covariates: x Observations: Measure s and x s for a census (all N) or on a sample (n out of the N) x and can be: ) continuous (ratio or interval scale); or ) discrete (nominal or ordinal scale) Descriptive Statistics: summarize the sample data as means, variances, ranges, etc. Inferential Statistics: use the sample statistics to estimate the parameters of the population 9 0
6 Parameters for populations:. Mean -- μ e.g. for N4 and 5; 6; 3 7, 4 6 μ6. Range: Maximum value minimum value 3. Standard Deviation and Variance N ( i i μ) N 4. Covariance between x and : x 5. Correlation (Pearson s) between two variables, and x: ρ ρ x x x Ranges from - to +; with strong negative correlations near to - and strong positive correlations near to Distribution for -- frequenc of each value of or x (ma be divided into classes) 7. Probabilit Distribution of or x probabilit associated with each value x N ( i μ )( xi μx) i N 8. Mode -- most common value of or x 9. Median -- -value or x-value which divides the distribution (50% of N observations are above and 50% are below)
7 Example: 50 aspen trees of Alberta Descriptive Statistics: age N50 trees Mean 7 ears Median 73 ears 5% percentile 55 75% percentile 8 Minimum 4 Maximum 60 Variance 54.7 Standard Deviation.69. Compare mean versus median. Normal distribution? Pearson correlation of age and dbh for the population of N50 trees 3 4
8 Statistics from the Sample:. Mean -- e.g. for n3 and 5; 6; 3 7, 6. Range: Maximum value minimum value 3. Standard Deviation s and Variance s s s n ( i s i ) ( n ) 4. Standard Deviation of the sample means (also called the Standard Error, short for Standard Error of the Mean) and it s square called the variance of the sample means are estimated b: s s n and s s n 5. Coefficient of variation (CV): The standard deviation from the sample, divided b the sample mean. Ma be multiplied b 00 to get CV in percent. 6. Covariance between x and : s x s x n ( i i )( x i x) ( n ) 7. Correlation (Pearson s) between two variables, and x: r r x s s x x s Ranges from - to +; with strong negative correlations near to - and strong positive correlations near to Distribution for -- frequenc of each value of or x (ma be divided into classes) 5 6
9 9. Estimated Probabilit Distribution of or x probabilit associated with each value based on the n observations Example: n50 0. Mode -- most common value of or x. Median -- -value or x-value which divides the estimated probabilit distribution (50% of N observations are above and 50% are below) 7 8
10 n50 trees Mean 69 ears Median 68 ears 5% percentile 48 75% percentile 8 Minimum 4 Maximum 60 Variance Standard Deviation 5.69 ears Standard error of the mean. ears Good estimate of population values? Pearson correlation of age and dbh 0.66 with a p-value of for the sample of n50 trees from a population of 50 trees Null and alternative hpothesis for the p-value? What is a p-value? Sample Statistics to Estimate Population Parameters: If simple random sampling (ever observation has the same chance of being selected) is used to select n from N, then: Sample estimates are unbiased estimates of their counterparts (e.g., sample mean estimates the population mean), meaning that over all possible samples the sample statistics, averaged, would equal the population statistic. A particular sample value (e.g., sample mean) is called a point estimate -- do not necessaril equal the population parameter for a given sample. Can calculate an interval where the true population parameter is likel to be, with a certain probabilit. This is a Confidence Interval, and can be obtained for an population parameter, IF the distribution of the sample statistic is known. 9 0
11 Common continuous distributions: Normal: Smmetric distribution around μ Defined b μ and. If we know that a variable has a normal distribution, and we know these parameters, then we know the probabilit of getting an particular value for the variable. Probabilit tables are for μ0 and, and are often called z-tables. Examples: P(-<z<+) 0.68; P(-.96<z<.96)0.95. Notation example: For α0.05, z α z z-scores: scale the values for b subtracting the mean, and dividing b the standard deviation. z i i μ E.g., for mean0, and standard deviation of and 0, z-5.0 (an extreme value)
12 t-distribution: Smmetric distribution Table values have the center at 0. The spread varies with the degrees of freedom. As the sample size increases, the df increases, and the spread decreases, and will approach the normal distribution. Used for a normall distributed variable whenever the variance of that variable is not known. Notation examples: t n, α where n- is the degrees of freedom, in this case, and we are looking for the α percentile. For example, for n5 and α0.05, we are looking for t with 4 degrees of freedom and the percentile (will be a value around ). Χ distribution: Starts at zero, and is not smmetric Is the square of a normall distributed variable e.g. sample variances have a Χ distribution if the variable is normall distributed Need the degrees of freedom and the percentile as with the t-distribution 3 4
13 F-distribution: Is the ratio of variables that each have a Χ distribution eg. The ratio of sample variances for variables that are each normall distributed. Need the percentile, and two degrees of freedom (one for the numerator and one for the denominator) Central Limit Theorem: As n increases, the distribution of sample means will approach a normal distribution, even if the distribution is something else (e.g. could be non-smmetric) Tables in the Textbook: Some tables give the values for probabilit distribution for the degrees of freedom, and for the percentile. Others, give this for the degrees of freedom and for the alpha level (or sometimes alpha/). Must be careful in reading probabilit tables. Confidence Intervals for a single mean: Collect data and get point estimates: o The sample mean, to estimate of the population mean μ ---- Will be unbiased o The sample variance, s to estimate of the population variance ---- Will be unbiased Can calculate interval estimates of each point estimate e.g. 95% confidence interval for the true mean o If the s are normall distributed OR o The sample size is large enough that the Central Limit Theorem holds -- will be normall distributed 5 6
14 7 ( ) ( ) ( ) ( ) large ver is when or replacemen t with replacemen t; without / them add then and value each square items all sum over where infinite is sometimes items possible of out measured items N n s s N n N n s s n n s n n N N n i i n i i n i i n i i 8 n s t s CV + /, / : population the of mean true the for Intervals Confidence 95% 00 Variation t of Coefficien α
15 Examples: n is: 4 Plot volume ba/ha ave. dbh mean: variance: std.dev.: std.dev. of mean: t should be: 3.8 Actual 95% CI (+/-): NOTE: EXCEL: %(+/-) t: not correct!!! Hpothesis Tests: Can hpothesize what the true value of an population parameter might be, and state this as null hpothesis (H0: ) We also state an alternate hpothesis (H: or Ha: ) that it is a) not equal to this value; b) greater than this value; or c) less than this value Collect sample data to test this hpothesis From the sample data, we calculate a sample statistic as a point estimate of this population parameter and an estimated variance of the sample statistic. We calculate a test-statistic using the sample estimates Under H0, this test-statistic will follow a known distribution. If the test-statistic is ver unusual, compared to the tabular values for the known distribution, then the H0 is ver unlikel and we conclude H: 9 30
16 Example for a single mean: We believe that the average weight of ravens in Yukon is kg. H0: Standard Error of the Mean: Aside: What is the CV? Test statistic: t-distribution t H: A sample of 0 birds is taken (HOW??) and each bird is weighed and released. The average bird weight is 0.8 kg, and the standard deviation was 0.0 kg. Assuming the bird weights follow a normal distribution, we can use a t-test (wh not a z-test?) Under H0: this will follow a t-distribution with df n-. Find value from t-table and compare: Mean: Variance: Conclude? 3 3
17 The p-value: Is the probabilit that we would get a value outside of the sample test statistic. Example: Comparing two means: We believe that the average weight of male ravens differs from female ravens NOTE: In EXCEL use: tdist(x,df,tails) H0: H: μ μ or μ μ 0 μ μ or μ μ 0 A sample of 0 birds is taken and each bird is weighed and released. birds were males with an average weight of. kg and a standard deviation of 0.0 kg. 8 birds were females with an average weight of 0.8 and a standard deviation of 0.0 kg. Means? Sample Variances? 33 34
18 Test statistic: Errors for Hpothesis Tests t t ( ) s 0 ( n ) s + ( n ) s n + n Under H0: this will follow a t-distribution with df (n+n-). Find t-value from tables and compare, or use the p-value: H0 True H0 False Accept -α β (Tpe II error) Reject α (Tpe I error) -β Tpe I Error: Reject H0 when it was true. Probabilit of this happening is α Tpe II Error: Accept H0 when it is false. Probabilit of this happening is β Power of the test: Reject H0 when it is false. Probabilit of this is -β Conclude? 35 36
19 What increases power? Increase sample sizes, resulting in lower standard errors A larger difference between mean for H0 and for H Increase alpha. Will decrease beta. Fitting Equations REF: Idea is : - variable of interest (dependent variable) i ; hard to measure - eas to measure variables (predictor/ independent) that are related to the variable of interest, labeled x i, x i,...x mi - measure i, x i,...x mi for a sample of n items - use this sample to estimate an equation that relates i (dependent variable) to x i,..x mi (independent or predictor variables) - once equation is fitted, one can then just measure the x s, and get an estimate of without measuring it -- also can examine relationships between variables 37 38
20 Examples: Objective:. Percent deca i ; x i logten (dbh). Logten (volume) i ; x i logten(dbh), x i logten(height) 3. Branch length i ; x i relative height above ground, x i dbh, x 3i height Find estimates of β 0, β, β... β m such that the sum of squared differences between measured i and predicted i (usuall labeled as ŷ i, values on the line or surface) is the smallest (minimize the sum of squared errors, called least squared error). Tpes of Equations Simple Linear Equation: i β o + β x i + ε i OR Find estimates of β 0, β, β... β m such that the likelihood (probabilit) of getting these values is the largest (maximize the likelihood). Multiple Linear Equation: i β 0 + β x i + β x i +...+β m x mi +ε i Nonlinear Equation: takes man forms, for example: Finding the minimum of sum of squared errors is often easier. In some cases, the lead to the same estimates of parameters. i β 0 + β x i β x i β 3 +ε i 39 40
21 Simple Linear Regression (SLR) Population: i β 0 + β x i + ε i Sample: i b 0 + b x i + e i b 0 is an estimate of β 0 [intercept] b is an estimate of β [slope] μ β + β x Y x 0 ˆ i b0 + b xi ei i ŷ i is the predicted ; an estimate of the average for for a particular x value e i is an estimate of ε i, called the error or the residual; represents the variation in the dependent variable (the ) which is not accounted for b predictor variable (the x). Find b o (intercept; i when x i 0) and b (slope) so that SSE e i (sum of squared errors over all n sample observations) is the smallest (least squares solution) The variables do not have to be in the same units. Coefficients will change with different units of measure. Given estimates of b o and b, we can get an estimate of the dependent variable (the ) for ANY value of the x, within the ranges of x s represented in the original data. i ˆ i Example: Tree Height (m) hard to measure; Dbh (diameter at.3 m above ground in cm) eas to measure use Dbh squared for a linear equation h e i g h t i Difference between measured and the mean of ˆ i i Dbh squared ˆ Difference between measured and predicted i i ˆ i ˆ ( ) ( ˆ ) i i i i Difference between predicted and mean of ˆ i ˆ i i 4 4
22 Least Squares Solution: Finding the Set of Coefficients that Minimizes the Sum of Squared Errors To find the estimated coefficients that minimizes SSE for a particular set of sample data and a particular equation (form and variables):. Define the sum of squared errors (SSE) in terms of the measured minus the predicted s (the errors);. Take partial derivatives of the SSE equation with respect to each coefficient 3. Set these equal to zero (for the minimum) and solve for all of the equations (solve the set of equations using algebra or linear algebra). For linear models (simple or multiple linear), there will be one solution. We can mathematicall solve the set of partial derivative equations. WILL ALWAYS GO THROUGH THE POINT DEFINED BY ( x, ). Will alwas result in e i 0 For nonlinear models, this is not possible and we must search to find a solution (covered in FRST 530). If we used the criterion of finding the maximum likelihood (probabilit) rather than the minimum SSE, we would need to search for a solution, even for linear models (covered FRST 530)
23 45 Least Squares Solution for SLR: Find the set of estimated parameters (coefficients) that minimize sum of squared errors ( ) + n i i i n i i x b b e SSE 0 ) ( min ) min( ) min( Take partial derivatives with respect to b 0 and b, set them equal to zero and solve. ( ) + n i i x i b b b SSE 0 0 ) ( n i i n i i n i i n i i n i i n i n i i x n b n b x b nb x b b b x b 0 46 ( ) + n i i i i x b b x b SSE 0 ) ( n i i i n i n i i i n i i i n i n i i i i n i n i i i n i i n i i i n i n i i i x x x b x b x x b x b x b x x b x b x b x ) ( 0 With some further manipulations: ( )( ) ( ) SSx SPx n s n s x x x x b x x n i i n i i i ) ( ) ( Where SPx refers to the corrected sum of cross products for x and ; SSx refers to the corrected sum of squares for x [Class example]
24 Properties of b 0 and b b 0 and b are least squares estimates of β 0 and β. Under assumptions concerning the error term and sampling/ measurements, these are: Unbiased estimates; given man estimates of the slope and intercept for all possible samples, the average of the sample estimates will equal the true values The variabilit of these estimates from sample to sample can be estimated from the single sample; these estimated variances will be unbiased estimates of the true variances (and standard errors) The estimated intercept and slope will be the most precise (most efficient with the lowest variances) estimates possible (called Best ) These will also be the maximum likelihood estimates of the intercept and slope Assumptions of SLR Once coefficients are obtained, we must check the assumptions of SLR. Assumptions must be met to: obtain the desired characteristics assess goodness of fit (i.e., how well the regression line fits the sample data) test significance of the regression and other hpotheses calculate confidence intervals and test hpothesis for the true coefficients (population) calculate confidence intervals for mean predicted value given a set of x value (i.e. for the predicted given a particular value of the x) Need good estimates (unbiased or at least consistent) of the standard errors of coefficients and a known probabilit distribution to test hpotheses and calculate confidence intervals
25 Checking assumptions using residual Plots Assumptions of : Residual plot that meets the assumptions of a linear relationship, and equal variance of the observations:. a linear relationship between the and the x;. equal variance of errors; and 3. independence of errors (independent observations) can be visuall checked b using RESIDUAL PLOTS A residual plot shows the residual (i.e., i - ŷ i ) as the -axis and the predicted value ( ŷ i ) as the x-axis. The data points are evenl distributed about zero and there are no outliers (ver unusual points that ma be a measurement or entr error). For independence: Residual plots can also indicate unusual points (outliers) that ma be measurement errors, transcription errors, etc
26 Examples of Residual Plots Indicating Failures to Meet Assumptions:. The relationship between the x s and is linear. If not. The variance of the values must be the same for ever one of the x values. If not met, the spread around the line will not be even. met, the residual plot and the plot of vs. x will show a curved line: ht 60 ˆ 50 ˆ 40 ˆ 30 ˆ ˆ ˆ 4 0 ˆ Šˆ ˆ ˆ ˆ ˆ ˆ dbhsq Residual 5 ˆ * * * * * * 0 ˆ * * *** * *** *** *** ** ** * *** ***** *** *** * *** * 5 ˆ ********* ** ******* * **** * ** ** * ****** *** ** * ***** ** * ** *** * * * ***** ** * * * **** 0 ˆ ******** ** *** ******** * * * * ** ******* * * ******* ** * * * ******* * * * -5 ˆ * *** * * * * * * ** ** * * ** * ** * * * ** ** -0 ˆ * * * * * * * -5 ˆ * * -0 ˆ Š-ˆ ˆ ˆ ˆ ˆ ˆ ˆ Predicted Value of ht Result: If this assumption is not met, the estimated coefficients (slopes and intercept) will be unbiased, but the estimates of the standard deviation of these coefficients will be biased. we cannot calculate CI nor test the significance of the x Result: If this assumption is not met: the regression line does not fit the data well; biased estimates of coefficients variable. However, estimates of the coefficients of the regression line and goodness of fit are still unbiased and standard errors of the coefficients will occur 5 5
27 3. Each observation (i.e., x i and i ) must be independent of all other observations. In this case, we produce a different residual plot, where the residuals are on the -axis as before, but the x-axis is the variable that is thought to produce the dependencies (e.g., time). If not met, this revised residual plot will show a trend, indicating the residuals are not independent. we cannot calculate CI nor test the significance of the x variable. However, estimates of the coefficients of the regression line and goodness of fit are still unbiased Normalit Histogram or Plot A fourth assumption of the SLR is: 4. The values must be normall distributed for each of the x values. A histogram of the errors, and/or a normalit plot can be used to check this, as well as tests of normalit Result: If this assumption is not met, the estimated coefficients (slopes and intercept) will be unbiased, but the estimates of the standard deviation of these coefficients will be biased. Histogram # Boxplot 0.5+* 0.*.*.*.**** 8.******* 4.************** 7.******************** 40.***************************** ************************** 5.****************************** 60 *--+--* -0.5+***************************** 58.************************* 49.***************** ************** 8.************ 4.***********.**** 7.**** 7.*** 5..* ** HO: data are normal H: data are not normal Tests for Normalit 53 54
28 Test --Statistic p Value Shapiro-Wilk W Pr < W Kolmogorov-Smirnov D Pr > D Cramer-von Mises W-Sq Pr > W-Sq Anderson-Darling A-Sq Pr > A-Sq < Normal Probabilit Plot 0.5+ * * +** +++** +**** +**** ***** **** ***** **** **** **** ***+ **** *** +*** ***** +** +*** +**** * -.5+* Result: We cannot calculate CI nor test the significance of the x variable, since we do not know what probabilities to use. Also, estimated coefficients are no longer equal to the maximum likelihood solution. Example: Residual Frequenc Normal Plot of Residuals Normal Score Histogram of Residuals 0 Residual Volume versus dbh Residual Residual I Chart of Residuals Observation Number 0 5 Fit Residuals vs. Fits UCL.63 X0.000 LCL
29 Measurements and Sampling Assumptions The remaining assumptions are based on the measurements and collection of the sampling data. 5. The x values are measured without error (i.e., the x values are fixed). This can onl be known if the process of collecting the data is known. For example, if tree diameters are ver precisel measured, there will be little error. If this assumption is not met, the estimated coefficients (slopes and intercept) and their variances will be biased, since the x values are varing. 6. The values are randoml selected for value of the x variables (i.e., for each x value, a list of all possible values is made, and some are randoml selected). For man biological problems, the observations will be gathered using simple random sampling or sstematic sampling (grid across the land area). This does not strictl meet this assumption. Also, more complex sampling design such as multistage sampling (sampling large units and sampling smaller units within the large units), this assumption is not met. If the equation is correct, then this does not cause problems. If not, the estimated equation will be biased
30 Transformations Outliers: Unusual Points Common Transformations Powers x 3, x 0.5, etc. for relationships that look nonlinear log0, loge also for relationships that look nonlinear, or when the variances of are not equal around the line Sin- [arcsine] when the dependent variable is a proportion. Rank transformation: for non-normal data o Sort the variable o Assign a rank to each variable from to n o Transform the rank to normal (e.g., Blom Transformation) PROBLEM: loose some of the information in the original data Tr to transform x first and leave i variable of interest; however, this is not alwas possible. Use graphs to help choose transformations Check for points that are quite different from the others on: Graph of versus x Residual plot Do not delete the point as it MAY BE VALID! Check: Is this a measurement error? E.g., a tree height of 00 m is ver unlikel Is a transcription error? E.g. for adult person, a weight of 0 lbs was entered rather than 00 lbs. Is there something ver unusual about this point? e.g., a bird has a short beak, because it was damaged. Tr to fix the observation. If it is ver different than the others, or ou know there is a measurement error that cannot be fixed, then delete it and indicate this in our research report. On the residual plot, an outlier CAN occur if the model is not correct ma need a transformation of the variable(s), or an important variable is missing 59 60
31 Other methods, than SLR (and Multiple Linear Regression), when transformations do not work (some covered in FRST 530): Nonlinear least squares: Least squares solution for nonlinear models; uses a search algorithm to find estimated coefficients; has good properties for large datasets; still assumes normalit, equal variances, and independent observations Weighted least squares: for unequal variances. Estimate the variances and use these in weighting the least squares fit of the regression; assumes normalit and independent observations Measures of Goodness of Fit How well does the regression fit the sample data? For simple linear regression, a graph of the original data with the fitted line marked on the graph indicates how well the line fits the data [not possible with MLR] Two measures commonl used: coefficient of determination (r ) and standard error of the estimate(se E ). Generalized linear model: used for distributions other than normal (e.g., binomial, Poisson, etc.), but with no correlation between observations; uses maximum likelihood Generalized least Squares and Mixed Models: use maximum likelihood for fitting models with unequal variances, correlations over space, correlations over time, but normall distributed errors Generalized linear mixed models: Allows for unequal variances, correlations over space and/or time, and non-normal distributions; uses maximum likelihood 6 6
32 To calculate r and SE E, first, calculate the SSE (this is what was minimized): SSE n i e i n i n ( ˆ ) ( ( b + b x )) i i i The sum of squared differences between the measured and estimated s. Calculate the sum of squares for : n n n SS i i i i i ( ) n s ( n ) The sum of squared difference between the measured and the mean of -measures. NOTE: In some texts, this is called the sum of squares total. Calculate the sum of squares regression: n SSreg i ( ˆ ) b SPx SS SSE i The sum of squared differences between the mean of - measures and the predicted s from the fitted equation. Also, is the sum of squares for the sum of squared errors. i i 0 i SS SSE SSE SSreg Then: r SS SS SS And: SSE, SSY are based on s used in the equation will not be in original units if was transformed r coefficient of determination; proportion of variance of, accounted for b the regression using x Is the square of the correlation between x and O (ver poor horizontal surface representing no relationship between and x s) to (perfect fit surface passes through the data) SSE SE E n SSE is based on s used in the equation will not be in original units if was transformed SE E - standard error of the estimate; in same units as Under normalit of the errors: o ± SE E 68% of sample observations o ± SE E 95% of sample observations o Want low SEE 63 64
33 -variable was transformed: Can calculate estimates of these for the original -variable unit, called I (Fit Index) and estimated standard error of the estimate (SE E ), in order to compare to r and SE E of other equations where the was not transformed. I - SSE/SSY where SSE, SSY are in original units. NOTE must back-transform the predicted s to calculate the SSE in original units. Does not have the same properties as r, however: o it can be less than 0 o it is not the square of the correlation between the (in original units) and the x used in the equation. Estimated standard error of the estimate (SE E ), when the dependent variable,, has been transformed: SSE( original units) SE E ' n SE E - standard error of the estimate ; in same units as original units for the dependent variable want low SE E [Class example] Estimated Variances, Confidence Intervals and Hpothesis Tests Testing Whether the Regression is Significant Does knowledge of x improve the estimate of the mean of? Or is it a flat surface, which means we should just use the mean of as an estimate of mean for an x? SSE/ (n-): Called the Mean squared error, as would be the average of the squared error if we divided b n. Instead, we divide b n-. Wh? The degrees of freedom are n-; n observations with two statistics estimated from these, b 0 and b Under the assumptions of SLR, is an unbiased estimated of the true variance of the error terms (error variance) SSR/: Called the Mean Square Regression Degrees of Freedom: x-variable Under the assumptions of SLR, this is an estimate the error variance PLUS a term of variance explained b the regression using x
34 H0: Regression is not significant H: Regression is significant Same as: H0: β 0 [true slope is zero meaning no relationship with x] H: β 0 [slope is positive or negative, not zero] This can be tested using an F-test, as it is the ratio of two variances, or with a t-test since we are onl testing one coefficient (more on this later) Information for the F-test is often shown as an Analsis of Variance Table: Source df SS MS F p-value Regression MSreg F Prob F> SSreg SSreg/ MSreg/MSE F (,n-,- α) Residual n- SSE MSE SSE/(n-) Total n- SS [Class example and explanation of the p-value] Using an F test statistic: SSreg F SSE ( n ) MSreg MSE Under H0, this follows an F distribution for a - α/ percentile with and n- degrees of freedom. If the F for the fitted equation is larger than the F from the table, we reject H0 (not likel true). The regression is significant, in that the true slope is likel not equal to zero
35 Estimated Standard Errors for the Slope and Intercept Under the assumptions, we can obtain an unbiased estimated of the standard errors for the slope and for the intercept [measure of how these would var among different sample sets], using the one set of sample data. s s b0 b x MSE + n SSx MSE SSx MSE n i n SSx Confidence Intervals for the True Slope and Intercept Under the assumptions, confidence intervals can be calculated as: For β o : b 0 ± t α, n sb0 x i Hpothesis Tests for the True Slope and Intercept H0: β c [true slope is equal to the constant, c] H: β c [true slope differs from the constant c] Test statistic: b t c s b Under H0, this is distributed as a t value of t c t n-, -α/. Reject H o if t > t c. The procedure is similar for testing the true intercept for a particular value It is possible to do one-sided hpotheses also, where the alternative is that the true parameter (slope or intercept) is greater than (or less than) a specified constant c. MUST be careful with the t c as this is different. [class example] b t s ± For β : α, n b [class example] 69 70
36 Confidence Interval for the True Mean of given a particular x value For the mean of all possible -values given a particular value of x (μ x h ): where ˆ x s ˆ xh ˆ xh ± tn, α sˆ x h h b 0 + b x h MSE + n ( x x) h SSx Confidence Bands Plot of the confidence intervals for the mean of for several x-values. Will appear as:
37 Confidence Interval for or more -values given a particular x value For one possible new -value given a particular value of x: Where ˆ s ˆ ( new) xh ± tn, α sˆ ( new) x h ( new) x ˆ ( new) xh h b 0 + b x MSE + + n h ( x x) h SSx For the average of g new possible -values given a particular value of x: where ˆ ( new) ˆ ( new) xh ± tn, α sˆ ( newg ) x h x h b 0 + b x h Selecting Among Alternative Models Process to Fit an Equation using Least Squares Steps:. Sample data are needed, on which the dependent variable and all explanator (independent) variables are measured.. Make an transformations that are needed to meet the most critical assumption: The relationship between and x is linear. Example: volume β 0 + β dbh ma be linear whereas volume versus dbh is not. Use i volume, x i dbh. 3. Fit the equation to minimize the sum of squared error. 4. Check Assumptions. If not met, go back to Step. s [class example] ˆ ( new g ) xh MSE + + g n ( x x) h SSx 5. If assumptions are met, then interpret the results. Is the regression significant? What is the r? What is the SE E? Plot the fitted equation over the plot of versus x
38 For a number of models, select based on: Simple Linear Regression Example. Meeting assumptions: If an equation does not meet the assumption of a linear relationship, it is not a candidate model. Compare the fit statistics. Select higher r (or I ), and lower SE E (or SE E ) 3. Reject an models where the regression is not significant, since this model is no better than just using the mean of as the predicted value. 4. Select a model that is biologicall tractable. A simpler model is generall preferred, unless there are practical/biological reasons to select the more complex model 5. Consider the cost of using the model [class example] Temperature (x) Weight () Weight () Observation temp weight Et cetera Weight () 75 76
39 weight versus temperature Obs. temp weight x-diff x-diff. sq Et cetera weight 30 0 mean SSX,8.5 SSY3,9.8 SPXY6, temperature SPx b b0 b x SSx b: b0: NOTE: calculate b first, since this is needed to calculate b
40 From these, the residuals (errors) for the equation, and the sum of squared error (SSE) were calculated: residual Obs. weight -pred residual sq Et cetera SSE: And SSRSSY-SSE F with p0.00 (ver small) In excel use: fdist(x,df,df) to obtain a p-value r : 0.97 Root MSE Or SE E :.57 BUT: Before interpreting the ANOVA table, Are assumptions met? ANOVA Source df SS MS Model Error Total
41 residuals ( erro rs) residual plot predicted weight Normalit plot: Obs. sorted Stand. Rel. Prob. resids resids Freq. z- dist Etc. Linear? Equal variance? Independent observations? 8 8
42 cum ulative probabilit Probabilit plot z-value relative frequenc Prob. z-dist. Questions:. Are the assumptions of simple linear regression met? Evidence?. If so, interpret if this is a good equation based on goodness of it measures
43 3. Is the regression significant? For 95% confidence intervals for b0 and b, would also need estimated standard errors: s b 0 x MSE + n SSx s b MSE SSx The t-value for 6 degrees of freedom and the percentile is. (tinv(0.05,6) in EXCEL) 85 86
44 b 0 ± t α, n s For β o : 5.85 ± b t ± α, n s For β : ± Est. Coeff St. Error For b0: For b: CI: b0 b t(0.975,6).. lower upper Question: Could the real intercept be equal to 0? b b 0 Given a temperature of, what is the estimated average weight (predicted value) and a 95% confidence interval for this estimate? ˆ x ˆ ( x s s ˆ x h ˆ x h h b h 0 + b x h ) MSE + n ( x x) + h SSx ( 37.5) 8.50 ˆ xh ± tn, α sˆ x h 87 88
45 Given a temperature of, what is the estimated weight for an new observation, and a 95% confidence interval for this estimate? ˆ x ˆ ( h x h b 0 + b x h ) If assumptions were not met, we would have to make some transformations and start over again! s s ˆ x h ˆ x h MSE + + n ( x x) + h SSx ( 37.5) ˆ xh ± tn, sˆ α x h
46 SAS code: * wttemp.sas ; options ls70 ps50; run; DATA regdata; input temp weight; cards; run; DATA regdata; set regdata; tempsqtemp**; tempcubtemp**3; logtemplog(temp); run; Proc plot dataregdata; plot weight*(temp tempsq logtemp)'*'; run; * ; PROC REG dataregdata simple; model weighttemp; output outout phat rresid; run; * ; PROC PLOT DATAout; plot resid*hat; run; * ; PROC univariate dataout plot normal; Var resid; Run; 9 9
47 SAS outputs: ) Graphs which appears more linear? ) How man observations were there? 3) What is the mean weight? Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Intercept temp Variable t Value Pr > t Intercept 5.4 <.000 temp 3.98 <.000 The REG Procedure Model: MODEL Dependent Variable: weight Number of Observations Read 8 Number of Observations Used 8 Analsis of Variance Sum of Mean Source DF Squares Square F Value Model Error Corr. Total Source F Value Pr > F Model <.000 Error Corrected Total Root MSE.5760 R-Square Dependent Mean 7. Adj R-Sq 0.97 Coeff Var
48 Plot of resid*hat. Legend: A obs, B obs, etc. 6 ˆ 4 ˆ A A B ˆ A A A R e A s i A d 0 ˆ u A A A a l A - ˆ A A A -4 ˆ A A -6 ˆ Tests for Normalit Test --Statistic--- --p Value--- Shapiro-Wilk W Pr<W Kolmogorov-Smirnov D Pr>D >0.500 Cramer-von Mises W-Sq Pr>W-Sq >0.500 Anderson-Darling A-Sq Pr>A-Sq >0.500 The UNIVARIATE Procedure Variable: resid (Residual) Normal Probabilit Plot * * **++++*.5+ * *++++ +*+++ **+** * * ++++ * *+ * Predicted Value of weight 95 96
49 Multiple Linear Regression (MLR) For example: Population: i β 0 + β x i + β x i +...+β p x mi +ε i Sample: i b 0 + b x i + b x i +...+b p x mi +e i ˆ i b0 + b x i + b xi + K + bm xmi ei i β o is the intercept parameter β, β, β 3,..., β m are slope parameters x i, x i, x 3i... x mi independent variables ε i - is the error term or residual - is the variation in the dependent variable (the ) which is not accounted for b the independent variables (the x s). For an fitted equation (we have the estimated parameters), we can get the estimated average for the dependent variable, for an set of x s. This will be the predicted value for, which is the estimated average of, given the particular values for the x variables. NOTE: In text b Kutner et al. pm+. This is not be confused with the p- value indicating significance in hpothesis tests. ˆ i Predicted log0(vol) X log0(dbh) +. X log0(height) where b o -4.; b. ; b. estimated b finding the least squared error solution. Using this equation for dbh 30 cm, height8m, logten(dbh).48, logten(height).45; logten(vol) volume (m 3 ) This represents the estimated average volume for trees with dbh30 cm and height8 m. Note: This equation is originall a nonlinear equation: b vol a dbh Which was transformed to a linear equation using logarithms: log 0( vol) log0( a) + b log0( dbh) + c log0( ht) + log0ε And this was fitted using multiple linear regression ht c ε 97 98
50 For the observations in the sample data used to fit the regression, we can also get an estimate of the error (we have measured volume). If the measured volume for this tree was m 3, or in log0 units: error i ˆ i For the fitted equation using log0 units. In original units, the estimated error is NOTE: This is not simpl the antilog of Finding the Set of Coefficients that Minimizes the Sum of Squared Errors Same process as for SLR: Find the set of coefficients that results in the minimum SSE, just that there are more parameters, therefore more partial derivative equations and more equations o E.g., with 3 x-variables, there will be 4 coefficients (intercept plus 3 slopes) so four equations For linear models, there will be one unique mathematical solution. For nonlinear models, this is not possible and we must search to find a solution Using the criterion of finding the maximum likelihood (probabilit) rather than the minimum SSE, we would need to search for a solution, even for linear models (covered in other courses, e.g., FRST 530)
Definitions of terms and examples. Experimental Design. Sampling versus experiments. For each experimental unit, measures of the variables of
Experimental Design Sampling versus experiments similar to sampling and inventor design in that information about forest variables is gathered and analzed experiments presuppose intervention through appling
More informationMathematical Notation Math Introduction to Applied Statistics
Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and can be printed and given to the
More informationChapter 8 (More on Assumptions for the Simple Linear Regression)
EXST3201 Chapter 8b Geaghan Fall 2005: Page 1 Chapter 8 (More on Assumptions for the Simple Linear Regression) Your textbook considers the following assumptions: Linearity This is not something I usually
More informationunadjusted model for baseline cholesterol 22:31 Monday, April 19,
unadjusted model for baseline cholesterol 22:31 Monday, April 19, 2004 1 Class Level Information Class Levels Values TRETGRP 3 3 4 5 SEX 2 0 1 Number of observations 916 unadjusted model for baseline cholesterol
More informationLecture 1 Linear Regression with One Predictor Variable.p2
Lecture Linear Regression with One Predictor Variablep - Basics - Meaning of regression parameters p - β - the slope of the regression line -it indicates the change in mean of the probability distn of
More informationLecture notes on Regression & SAS example demonstration
Regression & Correlation (p. 215) When two variables are measured on a single experimental unit, the resulting data are called bivariate data. You can describe each variable individually, and you can also
More informationLecture 3: Inference in SLR
Lecture 3: Inference in SLR STAT 51 Spring 011 Background Reading KNNL:.1.6 3-1 Topic Overview This topic will cover: Review of hypothesis testing Inference about 1 Inference about 0 Confidence Intervals
More informationCOMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION
COMPREHENSIVE WRITTEN EXAMINATION, PAPER III FRIDAY AUGUST 26, 2005, 9:00 A.M. 1:00 P.M. STATISTICS 174 QUESTION Answer all parts. Closed book, calculators allowed. It is important to show all working,
More informationdf=degrees of freedom = n - 1
One sample t-test test of the mean Assumptions: Independent, random samples Approximately normal distribution (from intro class: σ is unknown, need to calculate and use s (sample standard deviation)) Hypotheses:
More informationCorrelation and the Analysis of Variance Approach to Simple Linear Regression
Correlation and the Analysis of Variance Approach to Simple Linear Regression Biometry 755 Spring 2009 Correlation and the Analysis of Variance Approach to Simple Linear Regression p. 1/35 Correlation
More informationEXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"
EXST704 - Regression Techniques Page 1 Using F tests instead of t-tests We can also test the hypothesis H :" œ 0 versus H :" Á 0 with an F test.! " " " F œ MSRegression MSError This test is mathematically
More informationAnswer Keys to Homework#10
Answer Keys to Homework#10 Problem 1 Use either restricted or unrestricted mixed models. Problem 2 (a) First, the respective means for the 8 level combinations are listed in the following table A B C Mean
More informationAssignment 9 Answer Keys
Assignment 9 Answer Keys Problem 1 (a) First, the respective means for the 8 level combinations are listed in the following table A B C Mean 26.00 + 34.67 + 39.67 + + 49.33 + 42.33 + + 37.67 + + 54.67
More information1) Answer the following questions as true (T) or false (F) by circling the appropriate letter.
1) Answer the following questions as true (T) or false (F) by circling the appropriate letter. T F T F T F a) Variance estimates should always be positive, but covariance estimates can be either positive
More informationEXST7015: Estimating tree weights from other morphometric variables Raw data print
Simple Linear Regression SAS example Page 1 1 ********************************************; 2 *** Data from Freund & Wilson (1993) ***; 3 *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; 4 ********************************************;
More informationGlossary. The ISI glossary of statistical terms provides definitions in a number of different languages:
Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the
More informationSimple Linear Regression: One Quantitative IV
Simple Linear Regression: One Quantitative IV Linear regression is frequently used to explain variation observed in a dependent variable (DV) with theoretically linked independent variables (IV). For example,
More informationSAS Procedures Inference about the Line ffl model statement in proc reg has many options ffl To construct confidence intervals use alpha=, clm, cli, c
Inference About the Slope ffl As with all estimates, ^fi1 subject to sampling var ffl Because Y jx _ Normal, the estimate ^fi1 _ Normal A linear combination of indep Normals is Normal Simple Linear Regression
More informationLecture 11: Simple Linear Regression
Lecture 11: Simple Linear Regression Readings: Sections 3.1-3.3, 11.1-11.3 Apr 17, 2009 In linear regression, we examine the association between two quantitative variables. Number of beers that you drink
More informationGeneral Linear Model (Chapter 4)
General Linear Model (Chapter 4) Outcome variable is considered continuous Simple linear regression Scatterplots OLS is BLUE under basic assumptions MSE estimates residual variance testing regression coefficients
More informationChapter 13 Student Lecture Notes Department of Quantitative Methods & Information Systems. Business Statistics
Chapter 13 Student Lecture Notes 13-1 Department of Quantitative Methods & Information Sstems Business Statistics Chapter 14 Introduction to Linear Regression and Correlation Analsis QMIS 0 Dr. Mohammad
More informationDETAILED CONTENTS PART I INTRODUCTION AND DESCRIPTIVE STATISTICS. 1. Introduction to Statistics
DETAILED CONTENTS About the Author Preface to the Instructor To the Student How to Use SPSS With This Book PART I INTRODUCTION AND DESCRIPTIVE STATISTICS 1. Introduction to Statistics 1.1 Descriptive and
More informationChapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression
BSTT523: Kutner et al., Chapter 1 1 Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression Introduction: Functional relation between
More informationFormal Statement of Simple Linear Regression Model
Formal Statement of Simple Linear Regression Model Y i = β 0 + β 1 X i + ɛ i Y i value of the response variable in the i th trial β 0 and β 1 are parameters X i is a known constant, the value of the predictor
More informationOverview Scatter Plot Example
Overview Topic 22 - Linear Regression and Correlation STAT 5 Professor Bruce Craig Consider one population but two variables For each sampling unit observe X and Y Assume linear relationship between variables
More informationFailure Time of System due to the Hot Electron Effect
of System due to the Hot Electron Effect 1 * exresist; 2 option ls=120 ps=75 nocenter nodate; 3 title of System due to the Hot Electron Effect ; 4 * TIME = failure time (hours) of a system due to drift
More information9 Correlation and Regression
9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the
More informationTABLES AND FORMULAS FOR MOORE Basic Practice of Statistics
TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics Exploring Data: Distributions Look for overall pattern (shape, center, spread) and deviations (outliers). Mean (use a calculator): x = x 1 + x
More informationIn Class Review Exercises Vartanian: SW 540
In Class Review Exercises Vartanian: SW 540 1. Given the following output from an OLS model looking at income, what is the slope and intercept for those who are black and those who are not black? b SE
More informationReview of Statistics 101
Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods
More informationInferences for Regression
Inferences for Regression An Example: Body Fat and Waist Size Looking at the relationship between % body fat and waist size (in inches). Here is a scatterplot of our data set: Remembering Regression In
More informationHandout 1: Predicting GPA from SAT
Handout 1: Predicting GPA from SAT appsrv01.srv.cquest.utoronto.ca> appsrv01.srv.cquest.utoronto.ca> ls Desktop grades.data grades.sas oldstuff sasuser.800 appsrv01.srv.cquest.utoronto.ca> cat grades.data
More informationResearch Design - - Topic 15a Introduction to Multivariate Analyses 2009 R.C. Gardner, Ph.D.
Research Design - - Topic 15a Introduction to Multivariate Analses 009 R.C. Gardner, Ph.D. Major Characteristics of Multivariate Procedures Overview of Multivariate Techniques Bivariate Regression and
More informationCorrelation and regression
1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,
More informationMultiple Linear Regression
Multiple Linear Regression University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 42 Passenger car mileage Consider the carmpg dataset taken from
More informationREVIEW 8/2/2017 陈芳华东师大英语系
REVIEW Hypothesis testing starts with a null hypothesis and a null distribution. We compare what we have to the null distribution, if the result is too extreme to belong to the null distribution (p
More informationSimple Linear Regression
Simple Linear Regression MATH 282A Introduction to Computational Statistics University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/ eariasca/math282a.html MATH 282A University
More informationContents. Acknowledgments. xix
Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables
More informationMathematical Notation Math Introduction to Applied Statistics
Mathematical Notation Math 113 - Introduction to Applied Statistics Name : Use Word or WordPerfect to recreate the following documents. Each article is worth 10 points and should be emailed to the instructor
More informationCh. 1: Data and Distributions
Ch. 1: Data and Distributions Populations vs. Samples How to graphically display data Histograms, dot plots, stem plots, etc Helps to show how samples are distributed Distributions of both continuous and
More informationChapter 2 Inferences in Simple Linear Regression
STAT 525 SPRING 2018 Chapter 2 Inferences in Simple Linear Regression Professor Min Zhang Testing for Linear Relationship Term β 1 X i defines linear relationship Will then test H 0 : β 1 = 0 Test requires
More informationdm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" = -/\<>*"; ODS LISTING;
dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" ---- + ---+= -/\*"; ODS LISTING; *** Table 23.2 ********************************************; *** Moore, David
More information3rd Quartile. 1st Quartile) Minimum
EXST7034 - Regression Techniques Page 1 Regression diagnostics dependent variable Y3 There are a number of graphic representations which will help with problem detection and which can be used to obtain
More informationANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS
ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS Ravinder Malhotra and Vipul Sharma National Dairy Research Institute, Karnal-132001 The most common use of statistics in dairy science is testing
More information9. Linear Regression and Correlation
9. Linear Regression and Correlation Data: y a quantitative response variable x a quantitative explanatory variable (Chap. 8: Recall that both variables were categorical) For example, y = annual income,
More informationChapter 16. Simple Linear Regression and dcorrelation
Chapter 16 Simple Linear Regression and dcorrelation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will
More informationSTA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007
STA 302 H1F / 1001 HF Fall 2007 Test 1 October 24, 2007 LAST NAME: SOLUTIONS FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 302 STA 1001 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator.
More informationStat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010
1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of
More informationIES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc
IES 612/STA 4-573/STA 4-576 Winter 2008 Week 1--IES 612-STA 4-573-STA 4-576.doc Review Notes: [OL] = Ott & Longnecker Statistical Methods and Data Analysis, 5 th edition. [Handouts based on notes prepared
More informationWeek 7.1--IES 612-STA STA doc
Week 7.1--IES 612-STA 4-573-STA 4-576.doc IES 612/STA 4-576 Winter 2009 ANOVA MODELS model adequacy aka RESIDUAL ANALYSIS Numeric data samples from t populations obtained Assume Y ij ~ independent N(μ
More informationNature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.
Understanding regression output from software Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals In 1966 Cyril Burt published a paper called The genetic determination of differences
More informationSTOR 455 STATISTICAL METHODS I
STOR 455 STATISTICAL METHODS I Jan Hannig Mul9variate Regression Y=X β + ε X is a regression matrix, β is a vector of parameters and ε are independent N(0,σ) Es9mated parameters b=(x X) - 1 X Y Predicted
More informationThis is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.
EXST3201 Chapter 13c Geaghan Fall 2005: Page 1 Linear Models Y ij = µ + βi + τ j + βτij + εijk This is a Randomized Block Design (RBD) with a single factor treatment arrangement (2 levels) which are fixed.
More informationStatistics for Managers using Microsoft Excel 6 th Edition
Statistics for Managers using Microsoft Excel 6 th Edition Chapter 13 Simple Linear Regression 13-1 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of
More informationInference for the Regression Coefficient
Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression line. We can shows that b 0 and b 1 are the unbiased estimates
More informationSPECIAL TOPICS IN REGRESSION ANALYSIS
1 SPECIAL TOPICS IN REGRESSION ANALYSIS Representing Nominal Scales in Regression Analysis There are several ways in which a set of G qualitative distinctions on some variable of interest can be represented
More informationAnalysis of Covariance (ANCOVA) with Two Groups
Chapter 226 Analysis of Covariance (ANCOVA) with Two Groups Introduction This procedure performs analysis of covariance (ANCOVA) for a grouping variable with 2 groups and one covariate variable. This procedure
More informationRegression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.
TCELL 9/4/205 36-309/749 Experimental Design for Behavioral and Social Sciences Simple Regression Example Male black wheatear birds carry stones to the nest as a form of sexual display. Soler et al. wanted
More informationSTATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002
Time allowed: 3 HOURS. STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002 This is an open book exam: all course notes and the text are allowed, and you are expected to use your own calculator.
More informationBusiness Statistics. Lecture 10: Course Review
Business Statistics Lecture 10: Course Review 1 Descriptive Statistics for Continuous Data Numerical Summaries Location: mean, median Spread or variability: variance, standard deviation, range, percentiles,
More information" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2
Notation and Equations for Final Exam Symbol Definition X The variable we measure in a scientific study n The size of the sample N The size of the population M The mean of the sample µ The mean of the
More informationSTATISTICAL DATA ANALYSIS IN EXCEL
Microarra Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 5 Linear Regression dr. Petr Nazarov 14-1-213 petr.nazarov@crp-sante.lu Statistical data analsis in Ecel. 5. Linear regression OUTLINE Lecture
More informationSociology 6Z03 Review II
Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability
More informationBasic Business Statistics 6 th Edition
Basic Business Statistics 6 th Edition Chapter 12 Simple Linear Regression Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value of a dependent variable based
More informationSTAT 3A03 Applied Regression With SAS Fall 2017
STAT 3A03 Applied Regression With SAS Fall 2017 Assignment 2 Solution Set Q. 1 I will add subscripts relating to the question part to the parameters and their estimates as well as the errors and residuals.
More informationST 512-Practice Exam I - Osborne Directions: Answer questions as directed. For true/false questions, circle either true or false.
ST 512-Practice Exam I - Osborne Directions: Answer questions as directed. For true/false questions, circle either true or false. 1. A study was carried out to examine the relationship between the number
More informationHypothesis Testing hypothesis testing approach
Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we
More informationGlossary for the Triola Statistics Series
Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling
More informationInference for Regression Simple Linear Regression
Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression p Statistical model for linear regression p Estimating
More informationSTA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6
STA 8 Applied Linear Models: Regression Analysis Spring 011 Solution for Homework #6 6. a) = 11 1 31 41 51 1 3 4 5 11 1 31 41 51 β = β1 β β 3 b) = 1 1 1 1 1 11 1 31 41 51 1 3 4 5 β = β 0 β1 β 6.15 a) Stem-and-leaf
More informationStatistical tables are attached Two Hours UNIVERSITY OF MANCHESTER. May 2007 Final Draft
Statistical tables are attached wo Hours UNIVERSIY OF MNHESER Ma 7 Final Draft Medical Statistics M377 Electronic calculators ma be used provided that the cannot store tet nswer LL si questions in SEION
More information36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression
36-309/749 Experimental Design for Behavioral and Social Sciences Sep. 22, 2015 Lecture 4: Linear Regression TCELL Simple Regression Example Male black wheatear birds carry stones to the nest as a form
More informationPubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES
PubH 7405: REGRESSION ANALYSIS SLR: DIAGNOSTICS & REMEDIES Normal Error RegressionModel : Y = β 0 + β ε N(0,σ 2 1 x ) + ε The Model has several parts: Normal Distribution, Linear Mean, Constant Variance,
More informationCorrelation and Simple Linear Regression
Correlation and Simple Linear Regression Sasivimol Rattanasiri, Ph.D Section for Clinical Epidemiology and Biostatistics Ramathibodi Hospital, Mahidol University E-mail: sasivimol.rat@mahidol.ac.th 1 Outline
More informationST505/S697R: Fall Homework 2 Solution.
ST505/S69R: Fall 2012. Homework 2 Solution. 1. 1a; problem 1.22 Below is the summary information (edited) from the regression (using R output); code at end of solution as is code and output for SAS. a)
More informationOdor attraction CRD Page 1
Odor attraction CRD Page 1 dm'log;clear;output;clear'; options ps=512 ls=99 nocenter nodate nonumber nolabel FORMCHAR=" ---- + ---+= -/\*"; ODS LISTING; *** Table 23.2 ********************************************;
More informationCorrelation & Simple Regression
Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.
More informationSTAT 3A03 Applied Regression Analysis With SAS Fall 2017
STAT 3A03 Applied Regression Analysis With SAS Fall 2017 Assignment 5 Solution Set Q. 1 a The code that I used and the output is as follows PROC GLM DataS3A3.Wool plotsnone; Class Amp Len Load; Model CyclesAmp
More informationIntroduction to Regression
Introduction to Regression Using Mult Lin Regression Derived variables Many alternative models Which model to choose? Model Criticism Modelling Objective Model Details Data and Residuals Assumptions 1
More informationECON3150/4150 Spring 2015
ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2
More information1 A Review of Correlation and Regression
1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then
More informationBIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression
BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression Introduction to Correlation and Regression The procedures discussed in the previous ANOVA labs are most useful in cases where we are interested
More informationBE640 Intermediate Biostatistics 2. Regression and Correlation. Simple Linear Regression Software: SAS. Emergency Calls to the New York Auto Club
BE640 Intermediate Biostatistics 2. Regression and Correlation Simple Linear Regression Software: SAS Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and Simonoff JS A Casebook
More informationLecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is
Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is Q = (Y i β 0 β 1 X i1 β 2 X i2 β p 1 X i.p 1 ) 2, which in matrix notation is Q = (Y Xβ) (Y
More informationSTATISTICS 479 Exam II (100 points)
Name STATISTICS 79 Exam II (1 points) 1. A SAS data set was created using the following input statement: Answer parts(a) to (e) below. input State $ City $ Pop199 Income Housing Electric; (a) () Give the
More informationInference for Regression Inference about the Regression Model and Using the Regression Line
Inference for Regression Inference about the Regression Model and Using the Regression Line PBS Chapter 10.1 and 10.2 2009 W.H. Freeman and Company Objectives (PBS Chapter 10.1 and 10.2) Inference about
More informationAnalysis of variance and regression. April 17, Contents Comparison of several groups One-way ANOVA. Two-way ANOVA Interaction Model checking
Analysis of variance and regression Contents Comparison of several groups One-way ANOVA April 7, 008 Two-way ANOVA Interaction Model checking ANOVA, April 008 Comparison of or more groups Julie Lyng Forman,
More informationSimple Linear Regression
Chapter 2 Simple Linear Regression Linear Regression with One Independent Variable 2.1 Introduction In Chapter 1 we introduced the linear model as an alternative for making inferences on means of one or
More informationPredict y from (possibly) many predictors x. Model Criticism Study the importance of columns
Lecture Week Multiple Linear Regression Predict y from (possibly) many predictors x Including extra derived variables Model Criticism Study the importance of columns Draw on Scientific framework Experiment;
More informationSTA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3
STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae
More informationTopic 12. The Split-plot Design and its Relatives (continued) Repeated Measures
12.1 Topic 12. The Split-plot Design and its Relatives (continued) Repeated Measures 12.9 Repeated measures analysis Sometimes researchers make multiple measurements on the same experimental unit. We have
More informationChapter 16. Simple Linear Regression and Correlation
Chapter 16 Simple Linear Regression and Correlation 16.1 Regression Analysis Our problem objective is to analyze the relationship between interval variables; regression analysis is the first tool we will
More informationMA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7
MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7 1 Random Vectors Let a 0 and y be n 1 vectors, and let A be an n n matrix. Here, a 0 and A are non-random, whereas y is
More informationFRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE
FRANKLIN UNIVERSITY PROFICIENCY EXAM (FUPE) STUDY GUIDE Course Title: Probability and Statistics (MATH 80) Recommended Textbook(s): Number & Type of Questions: Probability and Statistics for Engineers
More informationFinal Exam - Solutions
Ecn 102 - Analysis of Economic Data University of California - Davis March 19, 2010 Instructor: John Parman Final Exam - Solutions You have until 5:30pm to complete this exam. Please remember to put your
More informationConfidence Intervals, Testing and ANOVA Summary
Confidence Intervals, Testing and ANOVA Summary 1 One Sample Tests 1.1 One Sample z test: Mean (σ known) Let X 1,, X n a r.s. from N(µ, σ) or n > 30. Let The test statistic is H 0 : µ = µ 0. z = x µ 0
More information5.3 Three-Stage Nested Design Example
5.3 Three-Stage Nested Design Example A researcher designs an experiment to study the of a metal alloy. A three-stage nested design was conducted that included Two alloy chemistry compositions. Three ovens
More informationLinear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).
Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation
More informationA Re-Introduction to General Linear Models (GLM)
A Re-Introduction to General Linear Models (GLM) Today s Class: You do know the GLM Estimation (where the numbers in the output come from): From least squares to restricted maximum likelihood (REML) Reviewing
More informationTHE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE
THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future
More information